SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
Introduction Example Regex Other Methods PDFs

BeautifulSoup: Web Scraping with Python
Andrew Peterson

Apr 9, 2013

files available at:
https://github.com/aristotle-tek/BeautifulSoup_pres

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Roadmap

Uses: data types, examples...
Getting Started
downloading files with wget
BeautifulSoup: in depth example - election results table
Additional commands, approaches
PDFminer
(time permitting) additional examples

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Etiquette/ Ethics

Similar rules of etiquette apply as Pablo mentioned:
Limit requests, protect privacy, play nice...

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data/Page formats on the web

HTML, HTML5 (<!DOCTYPE html>)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data/Page formats on the web

HTML, HTML5 (<!DOCTYPE html>)
data formats: XML, JSON
PDF
APIs
other languages of the web: css, java, php, asp.net...
(don’t forget existing datasets)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

BeautifulSoup

General purpose, robust, works with broken tags
Parses html and xml, including fixing asymmetric tags, etc.
Returns unicode text strings
Alternatives: lxml (also parses html), Scrapey
Faster alternatives: ElementTree, SGMLParser (custom)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Installation

pip install beautifulsoup4 or
easy_install beautifulsoup4
See: http://www.crummy.com/software/BeautifulSoup/
On installing libraries:
http://docs.python.org/2/install/

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Table basics

<table> Defines a table
<th> Defines a header cell in a table
<tr> Defines a row in a table
<td> Defines a cell in a table

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Tables

BeautifulSoup
Introduction Example Regex Other Methods PDFs

HTML Tables

<h4>Simple
<table>
<tr>
<td>[r1,
<td>[r1,
</tr>
<tr>
<td>[r2,
<td>[r2,
</tr>
</table>

table:</h4>

c1] </td>
c2] </td>

c1]</td>
c2]</td>

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Example: Election data from html table

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Example: Election data from html table

election results spread across hundreds of pages
want to quickly put in useable format (e.g. csv)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

website might change at any moment
ability to replicate research
limits page requests

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

I use wget (GNU), which can be called from within python
alternatively cURL may be better for macs, or scrapy

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Download relevant pages

wget: note the --no-parent option!

os.system("wget --convert-links --no-clobber 
--wait=4 
--limit-rate=10K 
-r --no-parent http://www.necliberia.org/results2011/results

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Step one: view page source

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Outline of Our Approach

1
2

identify the county and precinct number
get the table:
identify the correct table
put the rows into a list
for each row, identify cells
use regular expressions to identify the party & lastname

3

write a row to the csv file

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Open a page

soup = BeautifulSoup(html_doc)
Print all: print(soup.prettify())
Print text: print(soup.get_text())

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Navigating the Page Structure

some sites use div, others put everything in tables.

BeautifulSoup
Introduction Example Regex Other Methods PDFs

find all

finds all the Tag and NavigableString objects that match the
criteria you give.
find table rows: find_all("tr")
e.g.:
for link in soup.find_all(’a’):
print(link.get(’href’))

BeautifulSoup
Introduction Example Regex Other Methods PDFs

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Let’s try it out

We’ll run through the code step-by-step

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

Allow precise and flexible matching of strings
precise: i.e. character-by-character (including spaces, etc)
flexible: specify a set of allowable characters, unknown
quantities
import re

BeautifulSoup
Introduction Example Regex Other Methods PDFs

from xkcd

(Licensed under Creative Commons Attribution-NonCommercial 2.5 License.)
BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: metacharacters

Metacharacters:
|. ^ $ * + ? { } [ ]  | ( )
excape metacharacters with backslash 

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: Character class

brackets [ ] allow matching of any element they contain
[A-Z] matches a capital letter, [0-9] matches a number
[a-z][0-9] matches a lowercase letter followed by a number

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: Repeat

star * matches the previous item 0 or more times
plus + matches the previous item 1 or more times
[A-Za-z]* would match only the first 3 chars of Xpr8r

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: match anything

dot . will match anything but line break characters r n
combined with * or + is very hungry!

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: or, optional

pipe is for ‘or’
‘abc|123’ matches ‘abc’ or ‘123’ but not ‘ab3’
question makes the preceeding item optional: c3?[a-z]+
would match c3po and also cpu

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions: in reverse

parser starts from beginning of string
can tell it to start from the end with $

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

d, w and s
D, W and S NOT digit (use outside char class)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Regular Expressions

Now let’s see some examples and put this to use to get the
party.

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Basic functions: Getting headers, titles, body

soup.head
soup.title
soup.body

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Basic functions

soup.b
id: soup.find_all(id="link2")
eliminate from the tree: decompose()

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Other Methods: Navigating the Parse Tree

With parent you move up the parse tree. With contents
you move down the tree.
contents is an ordered list of the Tag and NavigableString
objects contained within a page element.
nextSibling and previousSibling: skip to the next or
previous thing on the same level of the parse tree

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Data output

Create simple csv files: import csv
many other possible methods: e.g. use within a pandas
DataFrame (cf Wes McKinney)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Putting it all together

Loop over files
for vote total rows, make the party empty
print each row with the county and precinct number as
columns

BeautifulSoup
Introduction Example Regex Other Methods PDFs

PDFs

Can extract text, looping over 100s or 1,000s of pdfs.
not based on character recognition (OCR)

BeautifulSoup
Introduction Example Regex Other Methods PDFs

pdfminer

There are other packages, but pdfminer is focused more
directly on scraping (rather than creating) pdfs.
Can be executed in a single command, or step-by-step

BeautifulSoup
Introduction Example Regex Other Methods PDFs

pdfminer

BeautifulSoup
Introduction Example Regex Other Methods PDFs

PDFs

We’ll look at just using it within python in a single command,
outputting to a .txt file.
Sample pdfs from the National Security Archive Iraq War:
http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB418/

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Performing an action over all files

Often useful to do something over all files in a folder.
One way to do this is with glob:
import glob
for filename in glob.glob(’/filepath/*.pdf’):
print filename
see also an example file with pdfminer

BeautifulSoup
Introduction Example Regex Other Methods PDFs

Additional Examples

(time permitting): newspapers, output to pandas...

BeautifulSoup

Mais conteúdo relacionado

Mais procurados

Javascript built in String Functions
Javascript built in String FunctionsJavascript built in String Functions
Javascript built in String Functions
Avanitrambadiya
 

Mais procurados (20)

Introduction to DOM
Introduction to DOMIntroduction to DOM
Introduction to DOM
 
Python Django tutorial | Getting Started With Django | Web Development With D...
Python Django tutorial | Getting Started With Django | Web Development With D...Python Django tutorial | Getting Started With Django | Web Development With D...
Python Django tutorial | Getting Started With Django | Web Development With D...
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Django Tutorial | Django Web Development With Python | Django Training and Ce...
Django Tutorial | Django Web Development With Python | Django Training and Ce...Django Tutorial | Django Web Development With Python | Django Training and Ce...
Django Tutorial | Django Web Development With Python | Django Training and Ce...
 
Tutorial on Web Scraping in Python
Tutorial on Web Scraping in PythonTutorial on Web Scraping in Python
Tutorial on Web Scraping in Python
 
WEB Scraping.pptx
WEB Scraping.pptxWEB Scraping.pptx
WEB Scraping.pptx
 
Django for Beginners
Django for BeginnersDjango for Beginners
Django for Beginners
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
Getting started with Web Scraping in Python
Getting started with Web Scraping in PythonGetting started with Web Scraping in Python
Getting started with Web Scraping in Python
 
Web scraping in python
Web scraping in pythonWeb scraping in python
Web scraping in python
 
3.2 javascript regex
3.2 javascript regex3.2 javascript regex
3.2 javascript regex
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: Introduction
 
Javascript built in String Functions
Javascript built in String FunctionsJavascript built in String Functions
Javascript built in String Functions
 
Basic Html Knowledge for students
Basic Html Knowledge for studentsBasic Html Knowledge for students
Basic Html Knowledge for students
 
Django Girls Tutorial
Django Girls TutorialDjango Girls Tutorial
Django Girls Tutorial
 
Introduction to back-end
Introduction to back-endIntroduction to back-end
Introduction to back-end
 
django
djangodjango
django
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 

Destaque

Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
Jim Chang
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Sammy Fung
 
CITG Legal Knowledge & Terminology Certificate
CITG Legal Knowledge & Terminology CertificateCITG Legal Knowledge & Terminology Certificate
CITG Legal Knowledge & Terminology Certificate
Terry Bu
 
CCP's Mishu System Tsai and Dean
CCP's Mishu System Tsai and DeanCCP's Mishu System Tsai and Dean
CCP's Mishu System Tsai and Dean
SAINBAYAR Beejin
 
5 kinesis lightning
5 kinesis lightning5 kinesis lightning
5 kinesis lightning
BigDataCamp
 

Destaque (20)

Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Caring for Sharring
 Caring for Sharring  Caring for Sharring
Caring for Sharring
 
CITG Legal Knowledge & Terminology Certificate
CITG Legal Knowledge & Terminology CertificateCITG Legal Knowledge & Terminology Certificate
CITG Legal Knowledge & Terminology Certificate
 
TJOLI
TJOLITJOLI
TJOLI
 
Sincere Speech
Sincere SpeechSincere Speech
Sincere Speech
 
Opjo
OpjoOpjo
Opjo
 
CCP's Mishu System Tsai and Dean
CCP's Mishu System Tsai and DeanCCP's Mishu System Tsai and Dean
CCP's Mishu System Tsai and Dean
 
The pharaoh’s curses—the constraint from god
The pharaoh’s curses—the constraint from godThe pharaoh’s curses—the constraint from god
The pharaoh’s curses—the constraint from god
 
5 kinesis lightning
5 kinesis lightning5 kinesis lightning
5 kinesis lightning
 

Semelhante a Beautiful soup

Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
Dave Cross
 
PowerShell Core Skills (TechMentor Fall 2011)
PowerShell Core Skills (TechMentor Fall 2011)PowerShell Core Skills (TechMentor Fall 2011)
PowerShell Core Skills (TechMentor Fall 2011)
Concentrated Technology
 

Semelhante a Beautiful soup (20)

Introducing Modern Perl
Introducing Modern PerlIntroducing Modern Perl
Introducing Modern Perl
 
PDF Generation in Rails with Prawn and Prawn-to: John McCaffrey
PDF Generation in Rails with Prawn and Prawn-to: John McCaffreyPDF Generation in Rails with Prawn and Prawn-to: John McCaffrey
PDF Generation in Rails with Prawn and Prawn-to: John McCaffrey
 
NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010NoSQL, Hadoop, Cascading June 2010
NoSQL, Hadoop, Cascading June 2010
 
Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009Hadoop and Cascading At AJUG July 2009
Hadoop and Cascading At AJUG July 2009
 
PARADIGM IT.pptx
PARADIGM IT.pptxPARADIGM IT.pptx
PARADIGM IT.pptx
 
How Groovy Helps
How Groovy HelpsHow Groovy Helps
How Groovy Helps
 
Perly Parsing with Regexp::Grammars
Perly Parsing with Regexp::GrammarsPerly Parsing with Regexp::Grammars
Perly Parsing with Regexp::Grammars
 
What we can learn from Rebol?
What we can learn from Rebol?What we can learn from Rebol?
What we can learn from Rebol?
 
MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011MongoDB Aggregation MongoSF May 2011
MongoDB Aggregation MongoSF May 2011
 
Jiscad viz
Jiscad vizJiscad viz
Jiscad viz
 
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINAGetting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
Getting Started with PostGIS geographic database - Lasma Sietinsone, EDINA
 
Getting started with PostGIS geographic database
Getting started with PostGIS geographic databaseGetting started with PostGIS geographic database
Getting started with PostGIS geographic database
 
Sa
SaSa
Sa
 
Building data pipelines
Building data pipelinesBuilding data pipelines
Building data pipelines
 
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
Icsug dev day2014_road to damascus - conversion experience-lotusscript and @f...
 
Perl Teach-In (part 1)
Perl Teach-In (part 1)Perl Teach-In (part 1)
Perl Teach-In (part 1)
 
Processing XML with Java
Processing XML with JavaProcessing XML with Java
Processing XML with Java
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
PowerShell Core Skills (TechMentor Fall 2011)
PowerShell Core Skills (TechMentor Fall 2011)PowerShell Core Skills (TechMentor Fall 2011)
PowerShell Core Skills (TechMentor Fall 2011)
 

Mais de mustafa sarac

Mais de mustafa sarac (20)

Uluslararasilasma son
Uluslararasilasma sonUluslararasilasma son
Uluslararasilasma son
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Latka december digital
Latka december digitalLatka december digital
Latka december digital
 
Axial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manualAxial RC SCX10 AE2 ESC user manual
Axial RC SCX10 AE2 ESC user manual
 
Array programming with Numpy
Array programming with NumpyArray programming with Numpy
Array programming with Numpy
 
Math for programmers
Math for programmersMath for programmers
Math for programmers
 
The book of Why
The book of WhyThe book of Why
The book of Why
 
BM sgk meslek kodu
BM sgk meslek koduBM sgk meslek kodu
BM sgk meslek kodu
 
TEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimizTEGV 2020 Bireysel bagiscilarimiz
TEGV 2020 Bireysel bagiscilarimiz
 
How to make and manage a bee hotel?
How to make and manage a bee hotel?How to make and manage a bee hotel?
How to make and manage a bee hotel?
 
Cahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir miCahit arf makineler dusunebilir mi
Cahit arf makineler dusunebilir mi
 
How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?How did Software Got So Reliable Without Proof?
How did Software Got So Reliable Without Proof?
 
Staff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital MarketsStaff Report on Algorithmic Trading in US Capital Markets
Staff Report on Algorithmic Trading in US Capital Markets
 
Yetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimiYetiskinler icin okuma yazma egitimi
Yetiskinler icin okuma yazma egitimi
 
Consumer centric api design v0.4.0
Consumer centric api design v0.4.0Consumer centric api design v0.4.0
Consumer centric api design v0.4.0
 
State of microservices 2020 by tsh
State of microservices 2020 by tshState of microservices 2020 by tsh
State of microservices 2020 by tsh
 
Uber pitch deck 2008
Uber pitch deck 2008Uber pitch deck 2008
Uber pitch deck 2008
 
Wireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guideWireless solar keyboard k760 quickstart guide
Wireless solar keyboard k760 quickstart guide
 
State of Serverless Report 2020
State of Serverless Report 2020State of Serverless Report 2020
State of Serverless Report 2020
 
Dont just roll the dice
Dont just roll the diceDont just roll the dice
Dont just roll the dice
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

Beautiful soup

  • 1. Introduction Example Regex Other Methods PDFs BeautifulSoup: Web Scraping with Python Andrew Peterson Apr 9, 2013 files available at: https://github.com/aristotle-tek/BeautifulSoup_pres BeautifulSoup
  • 2. Introduction Example Regex Other Methods PDFs Roadmap Uses: data types, examples... Getting Started downloading files with wget BeautifulSoup: in depth example - election results table Additional commands, approaches PDFminer (time permitting) additional examples BeautifulSoup
  • 3. Introduction Example Regex Other Methods PDFs Etiquette/ Ethics Similar rules of etiquette apply as Pablo mentioned: Limit requests, protect privacy, play nice... BeautifulSoup
  • 4. Introduction Example Regex Other Methods PDFs Data/Page formats on the web HTML, HTML5 (<!DOCTYPE html>) BeautifulSoup
  • 5. Introduction Example Regex Other Methods PDFs Data/Page formats on the web HTML, HTML5 (<!DOCTYPE html>) data formats: XML, JSON PDF APIs other languages of the web: css, java, php, asp.net... (don’t forget existing datasets) BeautifulSoup
  • 6. Introduction Example Regex Other Methods PDFs BeautifulSoup General purpose, robust, works with broken tags Parses html and xml, including fixing asymmetric tags, etc. Returns unicode text strings Alternatives: lxml (also parses html), Scrapey Faster alternatives: ElementTree, SGMLParser (custom) BeautifulSoup
  • 7. Introduction Example Regex Other Methods PDFs Installation pip install beautifulsoup4 or easy_install beautifulsoup4 See: http://www.crummy.com/software/BeautifulSoup/ On installing libraries: http://docs.python.org/2/install/ BeautifulSoup
  • 8. Introduction Example Regex Other Methods PDFs HTML Table basics <table> Defines a table <th> Defines a header cell in a table <tr> Defines a row in a table <td> Defines a cell in a table BeautifulSoup
  • 9. Introduction Example Regex Other Methods PDFs HTML Tables BeautifulSoup
  • 10. Introduction Example Regex Other Methods PDFs HTML Tables <h4>Simple <table> <tr> <td>[r1, <td>[r1, </tr> <tr> <td>[r2, <td>[r2, </tr> </table> table:</h4> c1] </td> c2] </td> c1]</td> c2]</td> BeautifulSoup
  • 11. Introduction Example Regex Other Methods PDFs Example: Election data from html table BeautifulSoup
  • 12. Introduction Example Regex Other Methods PDFs Example: Election data from html table election results spread across hundreds of pages want to quickly put in useable format (e.g. csv) BeautifulSoup
  • 13. Introduction Example Regex Other Methods PDFs Download relevant pages website might change at any moment ability to replicate research limits page requests BeautifulSoup
  • 14. Introduction Example Regex Other Methods PDFs Download relevant pages I use wget (GNU), which can be called from within python alternatively cURL may be better for macs, or scrapy BeautifulSoup
  • 15. Introduction Example Regex Other Methods PDFs Download relevant pages wget: note the --no-parent option! os.system("wget --convert-links --no-clobber --wait=4 --limit-rate=10K -r --no-parent http://www.necliberia.org/results2011/results BeautifulSoup
  • 16. Introduction Example Regex Other Methods PDFs Step one: view page source BeautifulSoup
  • 17. Introduction Example Regex Other Methods PDFs Outline of Our Approach 1 2 identify the county and precinct number get the table: identify the correct table put the rows into a list for each row, identify cells use regular expressions to identify the party & lastname 3 write a row to the csv file BeautifulSoup
  • 18. Introduction Example Regex Other Methods PDFs Open a page soup = BeautifulSoup(html_doc) Print all: print(soup.prettify()) Print text: print(soup.get_text()) BeautifulSoup
  • 19. Introduction Example Regex Other Methods PDFs Navigating the Page Structure some sites use div, others put everything in tables. BeautifulSoup
  • 20. Introduction Example Regex Other Methods PDFs find all finds all the Tag and NavigableString objects that match the criteria you give. find table rows: find_all("tr") e.g.: for link in soup.find_all(’a’): print(link.get(’href’)) BeautifulSoup
  • 21. Introduction Example Regex Other Methods PDFs BeautifulSoup
  • 22. Introduction Example Regex Other Methods PDFs Let’s try it out We’ll run through the code step-by-step BeautifulSoup
  • 23. Introduction Example Regex Other Methods PDFs Regular Expressions Allow precise and flexible matching of strings precise: i.e. character-by-character (including spaces, etc) flexible: specify a set of allowable characters, unknown quantities import re BeautifulSoup
  • 24. Introduction Example Regex Other Methods PDFs from xkcd (Licensed under Creative Commons Attribution-NonCommercial 2.5 License.) BeautifulSoup
  • 25. Introduction Example Regex Other Methods PDFs Regular Expressions: metacharacters Metacharacters: |. ^ $ * + ? { } [ ] | ( ) excape metacharacters with backslash BeautifulSoup
  • 26. Introduction Example Regex Other Methods PDFs Regular Expressions: Character class brackets [ ] allow matching of any element they contain [A-Z] matches a capital letter, [0-9] matches a number [a-z][0-9] matches a lowercase letter followed by a number BeautifulSoup
  • 27. Introduction Example Regex Other Methods PDFs Regular Expressions: Repeat star * matches the previous item 0 or more times plus + matches the previous item 1 or more times [A-Za-z]* would match only the first 3 chars of Xpr8r BeautifulSoup
  • 28. Introduction Example Regex Other Methods PDFs Regular Expressions: match anything dot . will match anything but line break characters r n combined with * or + is very hungry! BeautifulSoup
  • 29. Introduction Example Regex Other Methods PDFs Regular Expressions: or, optional pipe is for ‘or’ ‘abc|123’ matches ‘abc’ or ‘123’ but not ‘ab3’ question makes the preceeding item optional: c3?[a-z]+ would match c3po and also cpu BeautifulSoup
  • 30. Introduction Example Regex Other Methods PDFs Regular Expressions: in reverse parser starts from beginning of string can tell it to start from the end with $ BeautifulSoup
  • 31. Introduction Example Regex Other Methods PDFs Regular Expressions d, w and s D, W and S NOT digit (use outside char class) BeautifulSoup
  • 32. Introduction Example Regex Other Methods PDFs Regular Expressions Now let’s see some examples and put this to use to get the party. BeautifulSoup
  • 33. Introduction Example Regex Other Methods PDFs Basic functions: Getting headers, titles, body soup.head soup.title soup.body BeautifulSoup
  • 34. Introduction Example Regex Other Methods PDFs Basic functions soup.b id: soup.find_all(id="link2") eliminate from the tree: decompose() BeautifulSoup
  • 35. Introduction Example Regex Other Methods PDFs Other Methods: Navigating the Parse Tree With parent you move up the parse tree. With contents you move down the tree. contents is an ordered list of the Tag and NavigableString objects contained within a page element. nextSibling and previousSibling: skip to the next or previous thing on the same level of the parse tree BeautifulSoup
  • 36. Introduction Example Regex Other Methods PDFs Data output Create simple csv files: import csv many other possible methods: e.g. use within a pandas DataFrame (cf Wes McKinney) BeautifulSoup
  • 37. Introduction Example Regex Other Methods PDFs Putting it all together Loop over files for vote total rows, make the party empty print each row with the county and precinct number as columns BeautifulSoup
  • 38. Introduction Example Regex Other Methods PDFs PDFs Can extract text, looping over 100s or 1,000s of pdfs. not based on character recognition (OCR) BeautifulSoup
  • 39. Introduction Example Regex Other Methods PDFs pdfminer There are other packages, but pdfminer is focused more directly on scraping (rather than creating) pdfs. Can be executed in a single command, or step-by-step BeautifulSoup
  • 40. Introduction Example Regex Other Methods PDFs pdfminer BeautifulSoup
  • 41. Introduction Example Regex Other Methods PDFs PDFs We’ll look at just using it within python in a single command, outputting to a .txt file. Sample pdfs from the National Security Archive Iraq War: http://www.gwu.edu/~nsarchiv/NSAEBB/NSAEBB418/ BeautifulSoup
  • 42. Introduction Example Regex Other Methods PDFs Performing an action over all files Often useful to do something over all files in a folder. One way to do this is with glob: import glob for filename in glob.glob(’/filepath/*.pdf’): print filename see also an example file with pdfminer BeautifulSoup
  • 43. Introduction Example Regex Other Methods PDFs Additional Examples (time permitting): newspapers, output to pandas... BeautifulSoup