SlideShare a Scribd company logo
1 of 27
Download to read offline
Overview of Python
                       web scraping tools

                                   Maik Röder
                         Barcelona Python Meetup Group
                                   17.05.2012




Friday, May 18, 2012
Data Scraping

                       • Automated Process
                        • Explore and download pages
                        • Grab content
                        • Store in a database or in a text file

Friday, May 18, 2012
urlparse

                       • Manipulate URL strings
                         urlparse.urlparse()
                         urlparse.urljoin()
                         urlparse.urlunparse()




Friday, May 18, 2012
urllib

                   • Download data through different protocols
                   • HTTP, FTP, ...
                       urllib.parse()
                       urllib.urlopen()
                       urllib.urlretrieve()



Friday, May 18, 2012
Scrape a web site
                       • Example: http://www.wunderground.com/




Friday, May 18, 2012
Preparation
          >>> from StringIO import StringIO
          >>> from urllib2 import urlopen
          >>> f = urlopen('http://
          www.wunderground.com/history/airport/
          BCN/2007/5/17/DailyHistory.html')
          >>> p = f.read()
          >>> d = StringIO(p)
          >>> f.close()


Friday, May 18, 2012
Beautifulsoup

                       • HTML/XML parser
                       • designed for quick turnaround projects like
                         screen-scraping
                       • http://www.crummy.com/software/
                         BeautifulSoup



Friday, May 18, 2012
BeautifulSoup

          from BeautifulSoup import *
          a = BeautifulSoup(d).findAll('a')
          [x['href'] for x in a]




Friday, May 18, 2012
Faster BeautifulSoup
         from BeautifulSoup import *
         p = SoupStrainer('a')
         a = BeautifulSoup(d, parseOnlyThese=p)
         [x['href'] for x in a]




Friday, May 18, 2012
Inspect the Element
                       • Inspect the Maximum temperature




Friday, May 18, 2012
Find the node
   >>> from BeautifulSoup import
   BeautifulSoup
   >>> soup = BeautifulSoup(d)
   >>> attrs = {'class':'nobr'}
   >>> nobrs = soup.findAll(attrs=attrs)
   >>> temperature = nobrs[3].span.string
   >>> print temperature
   23


Friday, May 18, 2012
htmllib.HTMLParser


                       • Interesting only for historical reasons
                       • based on sgmllib


Friday, May 18, 2012
htmllib5
  • Using the custom simpletree format
    • a built-in DOM-ish tree type (pythonic idioms)
          from html5lib import parse
          from html5lib import treebuilders
          e = treebuilders.simpletree.Element
          i = parse(d)
          a =[x for x in d if isinstance(x, e)
          and x.name= 'a']
          [x.attributes['href'] for x in a]


Friday, May 18, 2012
lxml
            • Library for processing XML and HTML
            • Based on C libraries install libxml2-dev
              sudo aptitude
                       sudo aptitude install libxslt-dev

            • Extends the ElementTree API
             • e.g. with XPath


Friday, May 18, 2012
lxml

                       from lxml import etree
                       t = etree.parse('t.xml')
                       for node in t.xpath('//a'):
                           node.tag
                           node.get('href')
                           node.items()
                           node.text
                           node.getParent()



Friday, May 18, 2012
twill
                       • Simple
                       • No JavaScript
                       • http://twill.idyll.org
                       • Some more interesting concepts
                        • Pages, Scenarios
                        • State Machines

Friday, May 18, 2012
twill

                       • Commonly used methods:
                         go()
                         code()
                         show()
                         showforms()
                         formvalue() (or fv())
                         submit()



Friday, May 18, 2012
Twill

        >>> from twill import commands as
        twill
        >>> from twill import get_browser
        >>> twill.go('http://www.google.com')
        >>> twill.showforms()
        >>> twill.formvalue(1, 'q', 'Python')
        >>> twill.showforms()
        >>> twill.submit()
        >>> get_browser().get_html()

Friday, May 18, 2012
Twill - acknowledge_equiv_refresh
                >>> twill.go("http://
                www.wunderground.com/history/
                airport/BCN/2007/5/17/
                DailyHistory.html")
                ...
                twill.errors.TwillException:
                infinite refresh loop discovered;
                aborting.
                Try turning off
                acknowledge_equiv_refresh...


Friday, May 18, 2012
Twill
       >>> twill.config
       ("acknowledge_equiv_refresh", "false")
       >>> twill.go("http://
       www.wunderground.com/history/airport/
       BCN/2007/5/17/DailyHistory.html")
       ==> at http://www.wunderground.com/
       history/airport/BCN/2007/5/17/
       DailyHistory.html
       'http://www.wunderground.com/history/
       airport/BCN/2007/5/17/
       DailyHistory.html'


Friday, May 18, 2012
mechanize
                       • Stateful programmatic web browsing
                       • navigation history
                       • HTML form state
                       • cookies
                       • ftp:, http: and file: URL schemes
                       • redirections
                       • proxies
                       • Basic and Digest HTTP authentication
Friday, May 18, 2012
mechanize - robots.txt
            >>> import mechanize
            >>> browser = mechanize.Browser()
            >>> browser.open('http://
            www.wunderground.com/history/
            airport/BCN/2007/5/17/
            DailyHistory.html')
            mechanize._response.httperror_see
            k_wrapper: HTTP Error 403:
            request disallowed by robots.txt


Friday, May 18, 2012
mechanize - robots.txt
      • Do not handle robots.txt
        browser.set_handle_robots(False)

      • Do not handle equiv
        browser.set_handle_equiv(False)
              browser.open('http://
              www.wunderground.com/history/
              airport/BCN/2007/5/17/
              DailyHistory.html')


Friday, May 18, 2012
Selenium


                       • http://seleniumhq.org
                       • Support for JavaScript


Friday, May 18, 2012
Selenium

          from selenium import webdriver
          from selenium.common.exceptions 
               import NoSuchElementException
          from selenium.webdriver.common.keys 
               import Keys
          import time



Friday, May 18, 2012
Selenium
          >>> browser = webdriver.Firefox()
          >>> browser.get("http://
          www.wunderground.com/history/airport/
          BCN/2007/5/17/DailyHistory.html")
          >>> a = browser.find_element_by_xpath
          ("(//span[contains(@class,'nobr')])
          [position()=2]/span").text
          browser.close()
          >>> print a
          23
Friday, May 18, 2012
Phantom JS


                       • http://www.phantomjs.org/



Friday, May 18, 2012

More Related Content

Viewers also liked

Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documentsTommy Tavenner
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupJim Chang
 
Top 5 Tools for Web Scraping
Top 5 Tools for Web ScrapingTop 5 Tools for Web Scraping
Top 5 Tools for Web ScrapingPromptCloud
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientistsErin Shellman
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scrapingScrapinghub
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapyorangain
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big DataShawn Hermans
 

Viewers also liked (12)

Scraping data from the web and documents
Scraping data from the web and documentsScraping data from the web and documents
Scraping data from the web and documents
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Top 5 Tools for Web Scraping
Top 5 Tools for Web ScrapingTop 5 Tools for Web Scraping
Top 5 Tools for Web Scraping
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
XPath for web scraping
XPath for web scrapingXPath for web scraping
XPath for web scraping
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)Python in the Hadoop Ecosystem (Rock Health presentation)
Python in the Hadoop Ecosystem (Rock Health presentation)
 
Web Scraping in Python with Scrapy
Web Scraping in Python with ScrapyWeb Scraping in Python with Scrapy
Web Scraping in Python with Scrapy
 
Pig and Python to Process Big Data
Pig and Python to Process Big DataPig and Python to Process Big Data
Pig and Python to Process Big Data
 

More from maikroeder

Encode RNA Dashboard
Encode RNA DashboardEncode RNA Dashboard
Encode RNA Dashboardmaikroeder
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2maikroeder
 
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...maikroeder
 
Cms - Content Management System Utilities for Django
Cms - Content Management System Utilities for DjangoCms - Content Management System Utilities for Django
Cms - Content Management System Utilities for Djangomaikroeder
 
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik RöderPlone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Rödermaikroeder
 

More from maikroeder (8)

Google charts
Google chartsGoogle charts
Google charts
 
Encode RNA Dashboard
Encode RNA DashboardEncode RNA Dashboard
Encode RNA Dashboard
 
Pandas
PandasPandas
Pandas
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
Introduction to ggplot2
Introduction to ggplot2Introduction to ggplot2
Introduction to ggplot2
 
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
Repoze Bfg - presented by Rok Garbas at the Python Barcelona Meetup October 2...
 
Cms - Content Management System Utilities for Django
Cms - Content Management System Utilities for DjangoCms - Content Management System Utilities for Django
Cms - Content Management System Utilities for Django
 
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik RöderPlone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
Plone Conference 2007: Acceptance Testing In Plone Using Funittest - Maik Röder
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 

Overview of python web scraping tools

  • 1. Overview of Python web scraping tools Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012
  • 2. Data Scraping • Automated Process • Explore and download pages • Grab content • Store in a database or in a text file Friday, May 18, 2012
  • 3. urlparse • Manipulate URL strings urlparse.urlparse() urlparse.urljoin() urlparse.urlunparse() Friday, May 18, 2012
  • 4. urllib • Download data through different protocols • HTTP, FTP, ... urllib.parse() urllib.urlopen() urllib.urlretrieve() Friday, May 18, 2012
  • 5. Scrape a web site • Example: http://www.wunderground.com/ Friday, May 18, 2012
  • 6. Preparation >>> from StringIO import StringIO >>> from urllib2 import urlopen >>> f = urlopen('http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html') >>> p = f.read() >>> d = StringIO(p) >>> f.close() Friday, May 18, 2012
  • 7. Beautifulsoup • HTML/XML parser • designed for quick turnaround projects like screen-scraping • http://www.crummy.com/software/ BeautifulSoup Friday, May 18, 2012
  • 8. BeautifulSoup from BeautifulSoup import * a = BeautifulSoup(d).findAll('a') [x['href'] for x in a] Friday, May 18, 2012
  • 9. Faster BeautifulSoup from BeautifulSoup import * p = SoupStrainer('a') a = BeautifulSoup(d, parseOnlyThese=p) [x['href'] for x in a] Friday, May 18, 2012
  • 10. Inspect the Element • Inspect the Maximum temperature Friday, May 18, 2012
  • 11. Find the node >>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup(d) >>> attrs = {'class':'nobr'} >>> nobrs = soup.findAll(attrs=attrs) >>> temperature = nobrs[3].span.string >>> print temperature 23 Friday, May 18, 2012
  • 12. htmllib.HTMLParser • Interesting only for historical reasons • based on sgmllib Friday, May 18, 2012
  • 13. htmllib5 • Using the custom simpletree format • a built-in DOM-ish tree type (pythonic idioms) from html5lib import parse from html5lib import treebuilders e = treebuilders.simpletree.Element i = parse(d) a =[x for x in d if isinstance(x, e) and x.name= 'a'] [x.attributes['href'] for x in a] Friday, May 18, 2012
  • 14. lxml • Library for processing XML and HTML • Based on C libraries install libxml2-dev sudo aptitude sudo aptitude install libxslt-dev • Extends the ElementTree API • e.g. with XPath Friday, May 18, 2012
  • 15. lxml from lxml import etree t = etree.parse('t.xml') for node in t.xpath('//a'): node.tag node.get('href') node.items() node.text node.getParent() Friday, May 18, 2012
  • 16. twill • Simple • No JavaScript • http://twill.idyll.org • Some more interesting concepts • Pages, Scenarios • State Machines Friday, May 18, 2012
  • 17. twill • Commonly used methods: go() code() show() showforms() formvalue() (or fv()) submit() Friday, May 18, 2012
  • 18. Twill >>> from twill import commands as twill >>> from twill import get_browser >>> twill.go('http://www.google.com') >>> twill.showforms() >>> twill.formvalue(1, 'q', 'Python') >>> twill.showforms() >>> twill.submit() >>> get_browser().get_html() Friday, May 18, 2012
  • 19. Twill - acknowledge_equiv_refresh >>> twill.go("http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html") ... twill.errors.TwillException: infinite refresh loop discovered; aborting. Try turning off acknowledge_equiv_refresh... Friday, May 18, 2012
  • 20. Twill >>> twill.config ("acknowledge_equiv_refresh", "false") >>> twill.go("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") ==> at http://www.wunderground.com/ history/airport/BCN/2007/5/17/ DailyHistory.html 'http://www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html' Friday, May 18, 2012
  • 21. mechanize • Stateful programmatic web browsing • navigation history • HTML form state • cookies • ftp:, http: and file: URL schemes • redirections • proxies • Basic and Digest HTTP authentication Friday, May 18, 2012
  • 22. mechanize - robots.txt >>> import mechanize >>> browser = mechanize.Browser() >>> browser.open('http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html') mechanize._response.httperror_see k_wrapper: HTTP Error 403: request disallowed by robots.txt Friday, May 18, 2012
  • 23. mechanize - robots.txt • Do not handle robots.txt browser.set_handle_robots(False) • Do not handle equiv browser.set_handle_equiv(False) browser.open('http:// www.wunderground.com/history/ airport/BCN/2007/5/17/ DailyHistory.html') Friday, May 18, 2012
  • 24. Selenium • http://seleniumhq.org • Support for JavaScript Friday, May 18, 2012
  • 25. Selenium from selenium import webdriver from selenium.common.exceptions import NoSuchElementException from selenium.webdriver.common.keys import Keys import time Friday, May 18, 2012
  • 26. Selenium >>> browser = webdriver.Firefox() >>> browser.get("http:// www.wunderground.com/history/airport/ BCN/2007/5/17/DailyHistory.html") >>> a = browser.find_element_by_xpath ("(//span[contains(@class,'nobr')]) [position()=2]/span").text browser.close() >>> print a 23 Friday, May 18, 2012
  • 27. Phantom JS • http://www.phantomjs.org/ Friday, May 18, 2012