4. urllib
• Download data through different protocols
• HTTP, FTP, ...
urllib.parse()
urllib.urlopen()
urllib.urlretrieve()
Friday, May 18, 2012
5. Scrape a web site
• Example: http://www.wunderground.com/
Friday, May 18, 2012
6. Preparation
>>> from StringIO import StringIO
>>> from urllib2 import urlopen
>>> f = urlopen('http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html')
>>> p = f.read()
>>> d = StringIO(p)
>>> f.close()
Friday, May 18, 2012
7. Beautifulsoup
• HTML/XML parser
• designed for quick turnaround projects like
screen-scraping
• http://www.crummy.com/software/
BeautifulSoup
Friday, May 18, 2012
8. BeautifulSoup
from BeautifulSoup import *
a = BeautifulSoup(d).findAll('a')
[x['href'] for x in a]
Friday, May 18, 2012
9. Faster BeautifulSoup
from BeautifulSoup import *
p = SoupStrainer('a')
a = BeautifulSoup(d, parseOnlyThese=p)
[x['href'] for x in a]
Friday, May 18, 2012
11. Find the node
>>> from BeautifulSoup import
BeautifulSoup
>>> soup = BeautifulSoup(d)
>>> attrs = {'class':'nobr'}
>>> nobrs = soup.findAll(attrs=attrs)
>>> temperature = nobrs[3].span.string
>>> print temperature
23
Friday, May 18, 2012
12. htmllib.HTMLParser
• Interesting only for historical reasons
• based on sgmllib
Friday, May 18, 2012
13. htmllib5
• Using the custom simpletree format
• a built-in DOM-ish tree type (pythonic idioms)
from html5lib import parse
from html5lib import treebuilders
e = treebuilders.simpletree.Element
i = parse(d)
a =[x for x in d if isinstance(x, e)
and x.name= 'a']
[x.attributes['href'] for x in a]
Friday, May 18, 2012
14. lxml
• Library for processing XML and HTML
• Based on C libraries install libxml2-dev
sudo aptitude
sudo aptitude install libxslt-dev
• Extends the ElementTree API
• e.g. with XPath
Friday, May 18, 2012
15. lxml
from lxml import etree
t = etree.parse('t.xml')
for node in t.xpath('//a'):
node.tag
node.get('href')
node.items()
node.text
node.getParent()
Friday, May 18, 2012
16. twill
• Simple
• No JavaScript
• http://twill.idyll.org
• Some more interesting concepts
• Pages, Scenarios
• State Machines
Friday, May 18, 2012
17. twill
• Commonly used methods:
go()
code()
show()
showforms()
formvalue() (or fv())
submit()
Friday, May 18, 2012
18. Twill
>>> from twill import commands as
twill
>>> from twill import get_browser
>>> twill.go('http://www.google.com')
>>> twill.showforms()
>>> twill.formvalue(1, 'q', 'Python')
>>> twill.showforms()
>>> twill.submit()
>>> get_browser().get_html()
Friday, May 18, 2012
21. mechanize
• Stateful programmatic web browsing
• navigation history
• HTML form state
• cookies
• ftp:, http: and file: URL schemes
• redirections
• proxies
• Basic and Digest HTTP authentication
Friday, May 18, 2012
23. mechanize - robots.txt
• Do not handle robots.txt
browser.set_handle_robots(False)
• Do not handle equiv
browser.set_handle_equiv(False)
browser.open('http://
www.wunderground.com/history/
airport/BCN/2007/5/17/
DailyHistory.html')
Friday, May 18, 2012
24. Selenium
• http://seleniumhq.org
• Support for JavaScript
Friday, May 18, 2012
25. Selenium
from selenium import webdriver
from selenium.common.exceptions
import NoSuchElementException
from selenium.webdriver.common.keys
import Keys
import time
Friday, May 18, 2012
26. Selenium
>>> browser = webdriver.Firefox()
>>> browser.get("http://
www.wunderground.com/history/airport/
BCN/2007/5/17/DailyHistory.html")
>>> a = browser.find_element_by_xpath
("(//span[contains(@class,'nobr')])
[position()=2]/span").text
browser.close()
>>> print a
23
Friday, May 18, 2012
27. Phantom JS
• http://www.phantomjs.org/
Friday, May 18, 2012