SlideShare a Scribd company logo
1 of 13
Download to read offline
Web Scrapping with Python




               Miguel Miranda de Mattos
                      :@mmmattos - mmmattos.net
                            Porto Alegre, Brazil.
                                          2012
Web Scrapping with Python

     ● Tools:

       ○ BeautifulSoup

       ○ Mechanize
BeautifulSoup
 An HTML/XML parser for Python that can turn even invalid
markup into a parse tree. It provides simple, idiomatic ways
of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.

● In Summary:
   ○ Navigate the "soup" of HTML/XML tags,
     programatically

   ○ Access tag´s properties and values

   ○ Search for tags and their attributes.
BeautifulSoup
  ○ Example:

      from BeautifulSoup import BeautifulSoup
      doc = "<html><h1>Heading</h1><p>Text"
      soup = BeautifulSoup(doc)
      print soup.prettify()

      # <html>
      # <h1>
      # Heading
      # </h1>
      # <p>
      # Text
      # </p>
      # </html>
  ○
BeautifulSoup

  ○ Searching / Looking for things
    ■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild',
           'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings',
           'findParent', 'findParents', 'findPrevious', 'findPreviousSibling',
           'findPreviousSiblings'

      ■ findAll
        ● findAll(self, name=None, attrs={}, recursive=True,
                text=None, limit=None, **kwargs)

           ●       Extracts a list of Tag objects that match the given
           ●       criteria. You can specify the name of the Tag and any
           ●       attributes you want the Tag to have.
BeautifulSoup
● Example:
  >>> from BeautifulSoup import BeautifulSoup
  >>> doc = "<table><tr><td>one</td><td>two</td></tr></table>"
  >>> docSoup = BeautifulSoup(doc)

  >>> print docSoup.findAll('tr')
  [<tr><td>one</td><td>two</td></tr>]

  >>> print docSoup.findAll('td')
  [<td>one</td>, <td>two</td>]
BeautifulSoup
● findAll (cont´d.):
   >>> for t in docSoup.findAll('td'):
   >>>     print t

   <td>one</td>
   <td>two</td>

   >>> for t in docSoup.findAll('td'):
   >>>     print t.getText()

   one
   two
BeautifulSoup
●   findAll using attributes to qualify:
    >>> soup.findAll('div',attrs = {'class': 'Menus'})
    [<div>musicMenu</div>,<div>videoMenu</div>]

●   For more options:
    ○   dir (BeautifulSoup)
    ○   help (yourSoup.<command>)

●   Use BeautifulSoup rather than regexp patterns:
        patFinderTitle = re.compile(r'<a[^>]*stitle="(.*?)"')
        re.findAll(patFinderTitle, html)
    ○   by
        soup = BeautifulSoup(html)
        for tag in brand_row_soup.findAll('a'):
        print tag['title']
Mechanize
● Stateful programmatic web browsing in Python, after
   Andy Lester’s Perl module.
    ●   mechanize.Browser and mechanize.UserAgentBase, so:
          ○ any URL can be opened, not just http:
          ○ mechanize.UserAgentBase offers easy dynamic configuration of
            user-agent features like protocol, cookie, redirection and robots.
            txt handling, without having to make a new OpenerDirector each
            time, e.g. by callingbuild_opener().
    ●   Easy HTML form filling.
    ●   Convenient link parsing and following.
    ●   Browser history (.back() and .reload() methods).
    ●   The Referer HTTP header is added properly (optional).
    ●   Automatic observance of robots.txt.
    ●   Automatic handling of HTTP-Equiv and Refresh.
Mechanize
● Navigation commands:
  ○ open(url)
  ○ follow_link(link)
  ○ back()
  ○ submit()
  ○ reload()
● Examples
   br = mechanize.Browser()
   br.open("python.org")
   gothtml = br.response().read()
   for link in br.links(url_regex="python.org"):
      print link
      br.follow_link(link) # takes EITHER Link instance OR keyword args
      br.back()
Mechanize
● Example:
     import re
     import mechanize

     br = mechanize.Browser()
     br.open("http://www.example.com/")

     # follow second link with element text matching
     # regular expression
     response1 = br.follow_link(text_regex=r"cheeses*shop")

     assert br.viewing_html()
     print br.title()
     print response1.geturl()
     print response1.info() # headers
     print response1.read() # body
Mechanize
● Example: Combining Mechanize and BeautifulSoup
     import re
     import mechanize
     from BeautifulSoup import BeutifulSoup

     url = "http://www.hp.com"
     br = mechanize.Browser()
     br..open(url)
     assert br.viewing_html()
     html = br.response().read()
     result_soup = BeautifulSoup(html)

     found_divs = soup.findAll('div')
     print "Found " + str(len(found_divs))
     for d in found_divs:
           print d
Mechanize
● Example: Combining Mechanize and BeautifulSoup
     import re
     import mechanize

     url = "http://www.hp.com"
     br = mechanize.Browser()
     br..open(url)
     assert br.viewing_html()
     html = br.response().read()
     result_soup = BeautifulSoup(html)

     found_divs = soup.findAll('div')
     print "Found " + str(len(found_divs))
     for d in found_divs:
           if d.has_key('class'):
                 print d['class']

More Related Content

What's hot

How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.Diep Nguyen
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0IBM Cloud Data Services
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseBruce McPherson
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appBruce McPherson
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide webSergio Burdisso
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaVincent Terrasi
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourPeter Friese
 

What's hot (20)

Fun with Python
Fun with PythonFun with Python
Fun with Python
 
Pydata-Python tools for webscraping
Pydata-Python tools for webscrapingPydata-Python tools for webscraping
Pydata-Python tools for webscraping
 
How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.How to scraping content from web for location-based mobile app.
How to scraping content from web for location-based mobile app.
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
Selenium&amp;scrapy
Selenium&amp;scrapySelenium&amp;scrapy
Selenium&amp;scrapy
 
Scrapy-101
Scrapy-101Scrapy-101
Scrapy-101
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
CouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce ViewsCouchDB Day NYC 2017: MapReduce Views
CouchDB Day NYC 2017: MapReduce Views
 
CouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text SearchCouchDB Day NYC 2017: Full Text Search
CouchDB Day NYC 2017: Full Text Search
 
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0CouchDB Day NYC 2017: Introduction to CouchDB 2.0
CouchDB Day NYC 2017: Introduction to CouchDB 2.0
 
CouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: MangoCouchDB Day NYC 2017: Mango
CouchDB Day NYC 2017: Mango
 
Do something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as databaseDo something in 5 minutes with gas 1-use spreadsheet as database
Do something in 5 minutes with gas 1-use spreadsheet as database
 
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRealtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Do something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing appDo something in 5 with gas 3-simple invoicing app
Do something in 5 with gas 3-simple invoicing app
 
Routing @ Scuk.cz
Routing @ Scuk.czRouting @ Scuk.cz
Routing @ Scuk.cz
 
regular expressions and the world wide web
regular expressions and the world wide webregular expressions and the world wide web
regular expressions and the world wide web
 
Analyse your SEO Data with R and Kibana
Analyse your SEO Data with R and KibanaAnalyse your SEO Data with R and Kibana
Analyse your SEO Data with R and Kibana
 
Introducing CouchDB
Introducing CouchDBIntroducing CouchDB
Introducing CouchDB
 
CouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 HourCouchDB Mobile - From Couch to 5K in 1 Hour
CouchDB Mobile - From Couch to 5K in 1 Hour
 

Similar to Web Scrapping with Python

The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84Mahmoud Samir Fayed
 
The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189Mahmoud Samir Fayed
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesArtur Barseghyan
 
JavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxJavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxVivekBaghel30
 
The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202Mahmoud Samir Fayed
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talkdtdannen
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to DjangoJames Casey
 
Web performance essentials - Goodies
Web performance essentials - GoodiesWeb performance essentials - Goodies
Web performance essentials - GoodiesJerry Emmanuel
 
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRick Copeland
 
FYBSC IT Web Programming Unit III Core Javascript
FYBSC IT Web Programming Unit III  Core JavascriptFYBSC IT Web Programming Unit III  Core Javascript
FYBSC IT Web Programming Unit III Core JavascriptArti Parab Academics
 
Introduction to html5 and css3
Introduction to html5 and css3Introduction to html5 and css3
Introduction to html5 and css3Sunny Batabyal
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosIgor Sobreira
 
Jquery presentation
Jquery presentationJquery presentation
Jquery presentationguest5d87aa6
 
Introduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfIntroduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfDakshPratapSingh1
 
Neoito — How modern browsers work
Neoito — How modern browsers workNeoito — How modern browsers work
Neoito — How modern browsers workNeoito
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 

Similar to Web Scrapping with Python (20)

The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84The Ring programming language version 1.2 book - Part 33 of 84
The Ring programming language version 1.2 book - Part 33 of 84
 
The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189The Ring programming language version 1.6 book - Part 47 of 189
The Ring programming language version 1.6 book - Part 47 of 189
 
PyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slidesPyGrunn 2017 - Django Performance Unchained - slides
PyGrunn 2017 - Django Performance Unchained - slides
 
JavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptxJavaScriptL18 [Autosaved].pptx
JavaScriptL18 [Autosaved].pptx
 
The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202The Ring programming language version 1.8 book - Part 50 of 202
The Ring programming language version 1.8 book - Part 50 of 202
 
Django tech-talk
Django tech-talkDjango tech-talk
Django tech-talk
 
Introduction to Django
Introduction to DjangoIntroduction to Django
Introduction to Django
 
Web performance essentials - Goodies
Web performance essentials - GoodiesWeb performance essentials - Goodies
Web performance essentials - Goodies
 
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and MingRapid and Scalable Development with MongoDB, PyMongo, and Ming
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
 
FYBSC IT Web Programming Unit III Core Javascript
FYBSC IT Web Programming Unit III  Core JavascriptFYBSC IT Web Programming Unit III  Core Javascript
FYBSC IT Web Programming Unit III Core Javascript
 
Introduction to html5 and css3
Introduction to html5 and css3Introduction to html5 and css3
Introduction to html5 and css3
 
Django - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazosDjango - Framework web para perfeccionistas com prazos
Django - Framework web para perfeccionistas com prazos
 
Code Management
Code ManagementCode Management
Code Management
 
Jquery presentation
Jquery presentationJquery presentation
Jquery presentation
 
Introduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdfIntroduction to HTML-CSS-Javascript.pdf
Introduction to HTML-CSS-Javascript.pdf
 
Neoito — How modern browsers work
Neoito — How modern browsers workNeoito — How modern browsers work
Neoito — How modern browsers work
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Web Scrapping with Python

  • 1. Web Scrapping with Python Miguel Miranda de Mattos :@mmmattos - mmmattos.net Porto Alegre, Brazil. 2012
  • 2. Web Scrapping with Python ● Tools: ○ BeautifulSoup ○ Mechanize
  • 3. BeautifulSoup An HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work. ● In Summary: ○ Navigate the "soup" of HTML/XML tags, programatically ○ Access tag´s properties and values ○ Search for tags and their attributes.
  • 4. BeautifulSoup ○ Example: from BeautifulSoup import BeautifulSoup doc = "<html><h1>Heading</h1><p>Text" soup = BeautifulSoup(doc) print soup.prettify() # <html> # <h1> # Heading # </h1> # <p> # Text # </p> # </html> ○
  • 5. BeautifulSoup ○ Searching / Looking for things ■ 'find', 'findAll', 'findAllNext', 'findAllPrevious', 'findChild', 'findChildren', 'findNext', 'findNextSibling', 'findNextSiblings', 'findParent', 'findParents', 'findPrevious', 'findPreviousSibling', 'findPreviousSiblings' ■ findAll ● findAll(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) ● Extracts a list of Tag objects that match the given ● criteria. You can specify the name of the Tag and any ● attributes you want the Tag to have.
  • 6. BeautifulSoup ● Example: >>> from BeautifulSoup import BeautifulSoup >>> doc = "<table><tr><td>one</td><td>two</td></tr></table>" >>> docSoup = BeautifulSoup(doc) >>> print docSoup.findAll('tr') [<tr><td>one</td><td>two</td></tr>] >>> print docSoup.findAll('td') [<td>one</td>, <td>two</td>]
  • 7. BeautifulSoup ● findAll (cont´d.): >>> for t in docSoup.findAll('td'): >>> print t <td>one</td> <td>two</td> >>> for t in docSoup.findAll('td'): >>> print t.getText() one two
  • 8. BeautifulSoup ● findAll using attributes to qualify: >>> soup.findAll('div',attrs = {'class': 'Menus'}) [<div>musicMenu</div>,<div>videoMenu</div>] ● For more options: ○ dir (BeautifulSoup) ○ help (yourSoup.<command>) ● Use BeautifulSoup rather than regexp patterns: patFinderTitle = re.compile(r'<a[^>]*stitle="(.*?)"') re.findAll(patFinderTitle, html) ○ by soup = BeautifulSoup(html) for tag in brand_row_soup.findAll('a'): print tag['title']
  • 9. Mechanize ● Stateful programmatic web browsing in Python, after Andy Lester’s Perl module. ● mechanize.Browser and mechanize.UserAgentBase, so: ○ any URL can be opened, not just http: ○ mechanize.UserAgentBase offers easy dynamic configuration of user-agent features like protocol, cookie, redirection and robots. txt handling, without having to make a new OpenerDirector each time, e.g. by callingbuild_opener(). ● Easy HTML form filling. ● Convenient link parsing and following. ● Browser history (.back() and .reload() methods). ● The Referer HTTP header is added properly (optional). ● Automatic observance of robots.txt. ● Automatic handling of HTTP-Equiv and Refresh.
  • 10. Mechanize ● Navigation commands: ○ open(url) ○ follow_link(link) ○ back() ○ submit() ○ reload() ● Examples br = mechanize.Browser() br.open("python.org") gothtml = br.response().read() for link in br.links(url_regex="python.org"): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
  • 11. Mechanize ● Example: import re import mechanize br = mechanize.Browser() br.open("http://www.example.com/") # follow second link with element text matching # regular expression response1 = br.follow_link(text_regex=r"cheeses*shop") assert br.viewing_html() print br.title() print response1.geturl() print response1.info() # headers print response1.read() # body
  • 12. Mechanize ● Example: Combining Mechanize and BeautifulSoup import re import mechanize from BeautifulSoup import BeutifulSoup url = "http://www.hp.com" br = mechanize.Browser() br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html) found_divs = soup.findAll('div') print "Found " + str(len(found_divs)) for d in found_divs: print d
  • 13. Mechanize ● Example: Combining Mechanize and BeautifulSoup import re import mechanize url = "http://www.hp.com" br = mechanize.Browser() br..open(url) assert br.viewing_html() html = br.response().read() result_soup = BeautifulSoup(html) found_divs = soup.findAll('div') print "Found " + str(len(found_divs)) for d in found_divs: if d.has_key('class'): print d['class']