Parse The Web Using Python+Beautiful Soup

Parse the web
using Python + Beautiful Soup

at ncucc
cwebb(dot)tw(at)gmail(dot)com

Agenda

•
• Python
• Beautiful Soup

Parse the web?
but how?

Solutions

• C++
• Java
• Perl
• Python
• Others?

Solutions (Cont.)

•
• Regular expression
• Parser

Python

• high-level programming language
• scripting language
• Google

•
• {}
• list tuple dictionary

list
• a=[‘asdf’,123,12.01,‘abcd’]
• a[3] (a[-1])
• 12.01
• a[0:2] (a[:2])
• [‘asdf’,123,12.01]
• b=[‘asdf’,123,[‘qwer’,12.34]]

list (Cont.)
• a=[‘abc’,12]
• len(a)
• #2
• a.append(1)
• #[‘abc’,12,1]
• a.insert(1,‘def’)
• #[‘abc’,‘def’,12,1]

list (Cont.)
• a= [321,456,12,1]
• a.pop()
• #[321,456,12]
• a.index(12)
• #2
• a.sort()
• #1,12,321,456]

tuple

• a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01
• a=((‘abc’,1),123.1)
• a,b=1,2

Dictionary

• a={123:‘abc’,‘cde’:456}
• a[123]
• #abc’
• a[‘cde’]
• #456

if else
if a>10:
print ‘a>10’
elif a<5:
print ‘a<5’
else:
print ‘5<a<10

while loop
while a>2 or a<3:
pass

for loop
a=[‘abc’,123,‘def’] abc
for x in a: 123
print x def

0
for x in range(3):
1
print x
2

4
for x in range(4,34,10):
14
print x
24

function
def fib(n):
if n==0 or n==1:
return n
else:
return fib(n-1),fib(n-2)

What is Beautiful Soup
not Beautiful Soap

• python module
• html/xml parser
• html/xml
•

Beautiful Soup
<html>
<head>
<title>
page title
</title>
</head>
<body>

ﬁrst paragraph

one



second paragraph

two


</body>
</html>

check urllib/urllib2 to see
how to open a url in python

from BeautifulSoup import BeautifulSoup
soup=BeautifulSoup(page)

soup.html.head
#<head><title>page title</title></head>

soup.head
#<head><title>page title</title></head>

soup.body.p
#This is
paragraphone

(Cont.)
• parent (go to parent node)

soup.title.parent == soup.head

• next (go to next node)

soup.title.next == ‘page title’
soup.title.next.next == soup.body

• previous (go to previous node)

soup.title.previous == soup.head
sopu.body.p.previous == ‘ﬁrst paragraph’

(Cont.)
• contents (all content nodes)

soup.html.contents ==
[soup.html.head , soup.html.body]

• nextSibling (go to next sibling)

soup.html.body.p.nextSibling
== soup.html.body.contents[1]

• previousSibling (previous sibling)
soup.html.body.previousSibling
== soup.html.head

(Cont.)
• tag
soup.html.body.name == ‘body’

•
soup.html.head.title.string
== str(soup.html.head.title)
== soup.html.title.head.contents[0]
== ‘page title’

• Tag
soup.html.body.p.attrMap
== {'align' : 'center', 'id' : 'ﬁrstpara'}

soup.html.body.p[‘id’] == 'ﬁrstpara'

• ﬁnd(name, attrs, recursive, text)

tag

tag

tag

tag tag

tag

ﬁnd(name, attrs, recursive, text)

• soup.ﬁnd(‘p’)
#
This is paragraphone

find(name, attrs, recursive, text)

soup.find(‘p’) == soup.html.body.p

soup.find(‘p’,id=‘secondpara’)
#This is paragraphtwo

soup.find(‘p’,recuresive=False)==None

soup.find(text=‘one’)==soup.b.next

findAll(name, attrs, recursive, text,limit)

soup.findAll(‘p’) == [soup.html.body.p
,soup.p.nextSibling

soup.findAll(‘p’,id=‘secondpara’)
#[This is paragraphtwo]

soup.findAll(‘p’,recuresive=False)==[]

soup.findAll(text=‘one’)==soup.b.next

soup.findAll(limit=4)
==[soup.html , soup.html.body
,soup.html.body.title , soup.html.body]

Other solutions
• lxml
• html5lib
• HTMLParser
• htmlﬁll
• Genshi
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Reference
• Python Ofﬁcial Website
http://www.python.com/ (>///< )
http://www.python.org/

• Beautiful Soup documentation
http://www.crummy.com/software/BeautifulSoup/

• personal blog
http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/

• Python html parser performance
http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

Parse The Web Using Python+Beautiful Soup

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (20)

Similar to Parse The Web Using Python+Beautiful Soup

Similar to Parse The Web Using Python+Beautiful Soup (20)

Recently uploaded

Recently uploaded (20)

Parse The Web Using Python+Beautiful Soup