SlideShare a Scribd company logo
1 of 36
Download to read offline
Parse the web
    using Python + Beautiful Soup




                     at ncucc
                 cwebb(dot)tw(at)gmail(dot)com
Agenda

•
• Python
• Beautiful Soup
Parse the web?
            but how?
Solutions

• C++
• Java
• Perl
• Python
• Others?
Solutions (Cont.)

•
• Regular expression
•        Parser
So I decide...
Python + Beautiful Soup
Python + Beautiful Soup
Python

• high-level programming language
• scripting language
•         Google
•
•               {}
• list tuple dictionary
list
• a=[‘asdf’,123,12.01,‘abcd’]
• a[3] (a[-1])
 • 12.01
• a[0:2] (a[:2])
 • [‘asdf’,123,12.01]
• b=[‘asdf’,123,[‘qwer’,12.34]]
list (Cont.)
• a=[‘abc’,12]
• len(a)
• #2
• a.append(1)
• #[‘abc’,12,1]
• a.insert(1,‘def’)
• #[‘abc’,‘def’,12,1]
list (Cont.)
• a= [321,456,12,1]
• a.pop()
• #[321,456,12]
• a.index(12)
• #2
• a.sort()
• #1,12,321,456]
tuple

• a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01
• a=((‘abc’,1),123.1)
• a,b=1,2
Dictionary

• a={123:‘abc’,‘cde’:456}
• a[123]
• #abc’
• a[‘cde’]
• #456
if else
if a>10:
   print ‘a>10’
elif a<5:
   print ‘a<5’
else:
   print ‘5<a<10
while loop
while a>2 or a<3:
 pass
for loop
a=[‘abc’,123,‘def’]        abc
for x in a:                123
  print x                  def

                           0
for x in range(3):
                           1
  print x
                           2

                           4
for x in range(4,34,10):
                           14
  print x
                           24
function
def fib(n):
 if n==0 or n==1:
    return n
 else:
    return fib(n-1),fib(n-2)
....
What is Beautiful Soup
                    not Beautiful Soap


• python module
• html/xml parser
• html/xml
•
Beautiful Soup
<html>
 <head>
  <title>
    page title
  </title>
 </head>
 <body>
  <p id=quot;firstparaquot; align=quot;centerquot;>
    first paragraph
    <b>
     one
    </b>
  </p>
  <p id=quot;secondparaquot; align=quot;blahquot;>
    second paragraph
    <b>
     two
    </b>
  </p>
 </body>
</html>
check urllib/urllib2 to see
                                           how to open a url in python

from BeautifulSoup import BeautifulSoup
soup=BeautifulSoup(page)

soup.html.head
#<head><title>page title</title></head>

soup.head
#<head><title>page title</title></head>

soup.body.p
#<p id=quot;firstparaquot; align=quot;centerquot;>This is
paragraph<b>one</b></p>
(Cont.)
• parent         (go to parent node)

    soup.title.parent == soup.head

• next             (go to next node)

    soup.title.next == ‘page title’
    soup.title.next.next == soup.body

• previous     (go to previous node)

    soup.title.previous == soup.head
    sopu.body.p.previous == ‘first paragraph’
(Cont.)
• contents         (all content nodes)

     soup.html.contents ==
     [soup.html.head , soup.html.body]

• nextSibling      (go to next sibling)

     soup.html.body.p.nextSibling
     == soup.html.body.contents[1]

• previousSibling (previous sibling)
     soup.html.body.previousSibling
     == soup.html.head
(Cont.)
• tag
    soup.html.body.name == ‘body’

•
    soup.html.head.title.string
    == str(soup.html.head.title)
    == soup.html.title.head.contents[0]
    == ‘page title’

• Tag
    soup.html.body.p.attrMap
    == {'align' : 'center', 'id' : 'firstpara'}

    soup.html.body.p[‘id’] == 'firstpara'
• find(name, attrs, recursive, text)
• find(name, attrs, recursive, text)
             tag
tag


• find(name, attrs, recursive, text)
             tag
tag


• find(name, attrs, recursive, text)
             tag
tag                tag


• find(name, attrs, recursive, text)
             tag
find(name, attrs, recursive, text)



• soup.find(‘p’)
   #<p id=quot;firstparaquot; align=quot;centerquot;>
   This is paragraph<b>one</b></p>
find(name, attrs, recursive, text)


soup.find(‘p’) == soup.html.body.p

soup.find(‘p’,id=‘secondpara’)
  #<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>



soup.find(‘p’,recuresive=False)==None

soup.find(text=‘one’)==soup.b.next
findAll(name, attrs, recursive, text,limit)

soup.findAll(‘p’) == [soup.html.body.p
                     ,soup.p.nextSibling

soup.findAll(‘p’,id=‘secondpara’)
  #[<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>]



soup.findAll(‘p’,recuresive=False)==[]

soup.findAll(text=‘one’)==soup.b.next

soup.findAll(limit=4)
==[soup.html , soup.html.body
   ,soup.html.body.title , soup.html.body]
Other solutions
• lxml
• html5lib
• HTMLParser
• htmlfill
• Genshi
  http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
Reference
• Python Official Website
  http://www.python.com/ (>///<               )
  http://www.python.org/


• Beautiful Soup documentation
  http://www.crummy.com/software/BeautifulSoup/


• personal blog
  http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/


• Python html parser performance
  http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

More Related Content

What's hot (9)

bioinfolec_5th_20070713
bioinfolec_5th_20070713bioinfolec_5th_20070713
bioinfolec_5th_20070713
 
メタプログラミング入門
メタプログラミング入門メタプログラミング入門
メタプログラミング入門
 
Pr 1
Pr 1Pr 1
Pr 1
 
Ruby nooks & crannies
Ruby nooks & cranniesRuby nooks & crannies
Ruby nooks & crannies
 
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)
سیرت مصطفٰی صلّی اللہ تعالٰی علیہ وسلّم_Seerat e Mustafa (saw)
 
New text document
New text documentNew text document
New text document
 
Secrets of a Low Carb Diet
Secrets of a Low Carb DietSecrets of a Low Carb Diet
Secrets of a Low Carb Diet
 
Five
FiveFive
Five
 
cosc 281 hw3
cosc 281 hw3cosc 281 hw3
cosc 281 hw3
 

Viewers also liked

ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術
Drecom Co., Ltd.
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
Sammy Fung
 

Viewers also liked (20)

Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 
東京電機大学 ポータルサイト UNIPAからの情報抽出と再利用
東京電機大学 ポータルサイトUNIPAからの情報抽出と再利用東京電機大学 ポータルサイトUNIPAからの情報抽出と再利用
東京電機大学 ポータルサイト UNIPAからの情報抽出と再利用
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Web Scraping Technologies
Web Scraping TechnologiesWeb Scraping Technologies
Web Scraping Technologies
 
ログ分析のある生活(概要編)
ログ分析のある生活(概要編)ログ分析のある生活(概要編)
ログ分析のある生活(概要編)
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Bot or Not
Bot or NotBot or Not
Bot or Not
 
Learning Python from Data
Learning Python from DataLearning Python from Data
Learning Python from Data
 
ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術ログ解析を支えるNoSQLの技術
ログ解析を支えるNoSQLの技術
 
Pyladies Tokyo meet up #6
Pyladies Tokyo meet up #6Pyladies Tokyo meet up #6
Pyladies Tokyo meet up #6
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
2 5 1.一般化線形モデル色々_CPUE標準化
2 5 1.一般化線形モデル色々_CPUE標準化2 5 1.一般化線形モデル色々_CPUE標準化
2 5 1.一般化線形モデル色々_CPUE標準化
 

Similar to Parse The Web Using Python+Beautiful Soup

Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
Kang-min Liu
 
A3 sec -_regular_expressions
A3 sec -_regular_expressionsA3 sec -_regular_expressions
A3 sec -_regular_expressions
a3sec
 
Intro python
Intro pythonIntro python
Intro python
kamzilla
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
Josh Adell
 

Similar to Parse The Web Using Python+Beautiful Soup (20)

Impacta - Show Day de Rails
Impacta - Show Day de RailsImpacta - Show Day de Rails
Impacta - Show Day de Rails
 
[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port[Erlang LT] Regexp Perl And Port
[Erlang LT] Regexp Perl And Port
 
Erlang with Regexp Perl And Port
Erlang with Regexp Perl And PortErlang with Regexp Perl And Port
Erlang with Regexp Perl And Port
 
Good Evils In Perl
Good Evils In PerlGood Evils In Perl
Good Evils In Perl
 
Writing Modular Command-line Apps with App::Cmd
Writing Modular Command-line Apps with App::CmdWriting Modular Command-line Apps with App::Cmd
Writing Modular Command-line Apps with App::Cmd
 
My First Rails Plugin - Usertext
My First Rails Plugin - UsertextMy First Rails Plugin - Usertext
My First Rails Plugin - Usertext
 
Perl Presentation
Perl PresentationPerl Presentation
Perl Presentation
 
Why Python by Marilyn Davis, Marakana
Why Python by Marilyn Davis, MarakanaWhy Python by Marilyn Davis, Marakana
Why Python by Marilyn Davis, Marakana
 
Python and sysadmin I
Python and sysadmin IPython and sysadmin I
Python and sysadmin I
 
A3 sec -_regular_expressions
A3 sec -_regular_expressionsA3 sec -_regular_expressions
A3 sec -_regular_expressions
 
Exploiting Php With Php
Exploiting Php With PhpExploiting Php With Php
Exploiting Php With Php
 
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
Introduction to CodeIgniter (RefreshAugusta, 20 May 2009)
 
SWP - A Generic Language Parser
SWP - A Generic Language ParserSWP - A Generic Language Parser
SWP - A Generic Language Parser
 
Intro python
Intro pythonIntro python
Intro python
 
What's new in Rails 2?
What's new in Rails 2?What's new in Rails 2?
What's new in Rails 2?
 
Ae internals
Ae internalsAe internals
Ae internals
 
Graph Databases
Graph DatabasesGraph Databases
Graph Databases
 
Round PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing FunctionallyRound PEG, Round Hole - Parsing Functionally
Round PEG, Round Hole - Parsing Functionally
 
Meetup django common_problems(1)
Meetup django common_problems(1)Meetup django common_problems(1)
Meetup django common_problems(1)
 
Ruby 1.9
Ruby 1.9Ruby 1.9
Ruby 1.9
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Parse The Web Using Python+Beautiful Soup

  • 1. Parse the web using Python + Beautiful Soup at ncucc cwebb(dot)tw(at)gmail(dot)com
  • 3. Parse the web? but how?
  • 4. Solutions • C++ • Java • Perl • Python • Others?
  • 5. Solutions (Cont.) • • Regular expression • Parser
  • 9. Python • high-level programming language • scripting language • Google
  • 10. • • {} • list tuple dictionary
  • 11. list • a=[‘asdf’,123,12.01,‘abcd’] • a[3] (a[-1]) • 12.01 • a[0:2] (a[:2]) • [‘asdf’,123,12.01] • b=[‘asdf’,123,[‘qwer’,12.34]]
  • 12. list (Cont.) • a=[‘abc’,12] • len(a) • #2 • a.append(1) • #[‘abc’,12,1] • a.insert(1,‘def’) • #[‘abc’,‘def’,12,1]
  • 13. list (Cont.) • a= [321,456,12,1] • a.pop() • #[321,456,12] • a.index(12) • #2 • a.sort() • #1,12,321,456]
  • 14. tuple • a=(‘asdf’,123,12.01) or a= ‘asdf’,123,12.01 • a=((‘abc’,1),123.1) • a,b=1,2
  • 16. if else if a>10: print ‘a>10’ elif a<5: print ‘a<5’ else: print ‘5<a<10
  • 17. while loop while a>2 or a<3: pass
  • 18. for loop a=[‘abc’,123,‘def’] abc for x in a: 123 print x def 0 for x in range(3): 1 print x 2 4 for x in range(4,34,10): 14 print x 24
  • 19. function def fib(n): if n==0 or n==1: return n else: return fib(n-1),fib(n-2)
  • 20. ....
  • 21. What is Beautiful Soup not Beautiful Soap • python module • html/xml parser • html/xml •
  • 22. Beautiful Soup <html> <head> <title> page title </title> </head> <body> <p id=quot;firstparaquot; align=quot;centerquot;> first paragraph <b> one </b> </p> <p id=quot;secondparaquot; align=quot;blahquot;> second paragraph <b> two </b> </p> </body> </html>
  • 23. check urllib/urllib2 to see how to open a url in python from BeautifulSoup import BeautifulSoup soup=BeautifulSoup(page) soup.html.head #<head><title>page title</title></head> soup.head #<head><title>page title</title></head> soup.body.p #<p id=quot;firstparaquot; align=quot;centerquot;>This is paragraph<b>one</b></p>
  • 24. (Cont.) • parent (go to parent node) soup.title.parent == soup.head • next (go to next node) soup.title.next == ‘page title’ soup.title.next.next == soup.body • previous (go to previous node) soup.title.previous == soup.head sopu.body.p.previous == ‘first paragraph’
  • 25. (Cont.) • contents (all content nodes) soup.html.contents == [soup.html.head , soup.html.body] • nextSibling (go to next sibling) soup.html.body.p.nextSibling == soup.html.body.contents[1] • previousSibling (previous sibling) soup.html.body.previousSibling == soup.html.head
  • 26. (Cont.) • tag soup.html.body.name == ‘body’ • soup.html.head.title.string == str(soup.html.head.title) == soup.html.title.head.contents[0] == ‘page title’ • Tag soup.html.body.p.attrMap == {'align' : 'center', 'id' : 'firstpara'} soup.html.body.p[‘id’] == 'firstpara'
  • 27. • find(name, attrs, recursive, text)
  • 28. • find(name, attrs, recursive, text) tag
  • 29. tag • find(name, attrs, recursive, text) tag
  • 30. tag • find(name, attrs, recursive, text) tag
  • 31. tag tag • find(name, attrs, recursive, text) tag
  • 32. find(name, attrs, recursive, text) • soup.find(‘p’) #<p id=quot;firstparaquot; align=quot;centerquot;> This is paragraph<b>one</b></p>
  • 33. find(name, attrs, recursive, text) soup.find(‘p’) == soup.html.body.p soup.find(‘p’,id=‘secondpara’) #<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p> soup.find(‘p’,recuresive=False)==None soup.find(text=‘one’)==soup.b.next
  • 34. findAll(name, attrs, recursive, text,limit) soup.findAll(‘p’) == [soup.html.body.p ,soup.p.nextSibling soup.findAll(‘p’,id=‘secondpara’) #[<p id=quot;secondparaquot; align=quot;blahquot;>This is paragraph<b>two</b></p>] soup.findAll(‘p’,recuresive=False)==[] soup.findAll(text=‘one’)==soup.b.next soup.findAll(limit=4) ==[soup.html , soup.html.body ,soup.html.body.title , soup.html.body]
  • 35. Other solutions • lxml • html5lib • HTMLParser • htmlfill • Genshi http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/
  • 36. Reference • Python Official Website http://www.python.com/ (>///< ) http://www.python.org/ • Beautiful Soup documentation http://www.crummy.com/software/BeautifulSoup/ • personal blog http://blog.ez2learn.com/2008/10/05/python-is-the-best-choice-to-grab-web/ • Python html parser performance http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/