SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Web Scraping
                               @alvaro_aguirre




Saturday, November 5, 2011
In search of our cosmic origins...




Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Data Scraping
                                  vs
                             Web Scraping


Saturday, November 5, 2011
Data Scraping
                    <html>
                             <header></header>
                             <body>
                             .....
                             </body>
                    </html>




Saturday, November 5, 2011
Web Scraping




Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Deliverance
                               XDV
                             Diazo



Saturday, November 5, 2011
Diazo




Saturday, November 5, 2011
Saturday, November 5, 2011
<replace css:content=”h1” css:theme=”#main” />




Saturday, November 5, 2011
<drop css:content=”h1” />

                             <drop css:theme=”breadcrumbs” />




Saturday, November 5, 2011
<replace css:theme=”#header” content=”#header-
                       element” if-content=”” />




Saturday, November 5, 2011
<drop css:theme="#info-box" if-path="/news"/>




Saturday, November 5, 2011
<theme/>
                         <notheme/>
                         <replace/>
                         <before/>
                         <after/>
                         <drop/>
                         <strip/>
                         <merge/>
                         <copy/>




Saturday, November 5, 2011
<replace css:theme="#details">
          <dl id="details">
            <xsl:for-each css:select="table#details > tr">
              <dt><xsl:copy-of select="td[1]/text()" /></dt>
              <dd><xsl:copy-of select="td[2]/node()"/></dd>
            </xsl:for-each>
          </dl>
       </replace>/></dt>




      <table id="details">
                                              <dl id="details">
          <tr>
                                                  <dt>One</dt>
               <td>One</td>
                                                  <dd>1</dd>
               <td>1</td>
                                                  <dt>Two</dt>
          </tr>
                                                  <dd>2</dd>
          <tr>
                                              </dl>
               <td>Two</td>
               <td>2</td>
          </tr>
      </table>




Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Tools



Saturday, November 5, 2011
External Content




Saturday, November 5, 2011
Saturday, November 5, 2011
• development of web & mobile interfaces
                    • legacy apps integrations
                    • prototypes
                    • low coupling


Saturday, November 5, 2011
from diazo.compiler import compile_theme
                             from lxml import etree
                             from diazo.compiler import compile_theme

                             absolute_prefix = "/static"

                             rules = "rules.xml"
                             theme = "theme.html"

                             compiled_theme = compile_theme(rules, theme,
                                                            absolute_prefix=absolute_prefix)

                             transform = etree.XSLT(compiled_theme)
                             content = etree.parse(some_content)
                             transformed = transform(content)

                             output = etree.tostring(transformed)




Saturday, November 5, 2011
github/aaguirre




Saturday, November 5, 2011
diazo.org




Saturday, November 5, 2011
plone.org




Saturday, November 5, 2011
gracias!




Saturday, November 5, 2011

Mais conteúdo relacionado

Semelhante a Web Scraping using Diazo!

What's new in HTML5, CSS3 and JavaScript, James Pearce
What's new in HTML5, CSS3 and JavaScript, James PearceWhat's new in HTML5, CSS3 and JavaScript, James Pearce
What's new in HTML5, CSS3 and JavaScript, James PearceSencha
 
Moving to Dojo 1.7 and the path to 2.0
Moving to Dojo 1.7 and the path to 2.0Moving to Dojo 1.7 and the path to 2.0
Moving to Dojo 1.7 and the path to 2.0James Thomas
 
international PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPinternational PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPsmueller_sandsmedia
 
JS Tooling in Rails 3.1
JS Tooling in Rails 3.1JS Tooling in Rails 3.1
JS Tooling in Rails 3.1Duda Dornelles
 
Javascript - How to avoid the bad parts
Javascript - How to avoid the bad partsJavascript - How to avoid the bad parts
Javascript - How to avoid the bad partsMikko Ohtamaa
 
잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기형우 안
 
永不止步的“重构”
永不止步的“重构”永不止步的“重构”
永不止步的“重构”Kejun Zhang
 
A Tour Through the Groovy Ecosystem
A Tour Through the Groovy EcosystemA Tour Through the Groovy Ecosystem
A Tour Through the Groovy EcosystemLeonard Axelsson
 
让开发也懂前端
让开发也懂前端让开发也懂前端
让开发也懂前端lifesinger
 
Parser combinators
Parser combinatorsParser combinators
Parser combinatorslifecoder
 
Clouds against the Floods (RubyConfBR2011)
Clouds against the Floods (RubyConfBR2011) Clouds against the Floods (RubyConfBR2011)
Clouds against the Floods (RubyConfBR2011) Leonardo Borges
 
Ext GWT 3.0 Advanced Templates
Ext GWT 3.0 Advanced TemplatesExt GWT 3.0 Advanced Templates
Ext GWT 3.0 Advanced TemplatesSencha
 
HTML5 & CSS3 in Drupal (on the Bayou)
HTML5 & CSS3 in Drupal (on the Bayou)HTML5 & CSS3 in Drupal (on the Bayou)
HTML5 & CSS3 in Drupal (on the Bayou)Mediacurrent
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyBlazing Cloud
 
The Solar Framework for PHP
The Solar Framework for PHPThe Solar Framework for PHP
The Solar Framework for PHPConFoo
 
The Easy Way - Plone Conference 2011
The Easy Way - Plone Conference 2011The Easy Way - Plone Conference 2011
The Easy Way - Plone Conference 2011Mikko Ohtamaa
 

Semelhante a Web Scraping using Diazo! (20)

Semantic chop
Semantic chopSemantic chop
Semantic chop
 
What's new in HTML5, CSS3 and JavaScript, James Pearce
What's new in HTML5, CSS3 and JavaScript, James PearceWhat's new in HTML5, CSS3 and JavaScript, James Pearce
What's new in HTML5, CSS3 and JavaScript, James Pearce
 
Moving to Dojo 1.7 and the path to 2.0
Moving to Dojo 1.7 and the path to 2.0Moving to Dojo 1.7 and the path to 2.0
Moving to Dojo 1.7 and the path to 2.0
 
international PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPinternational PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHP
 
JS Tooling in Rails 3.1
JS Tooling in Rails 3.1JS Tooling in Rails 3.1
JS Tooling in Rails 3.1
 
Javascript - How to avoid the bad parts
Javascript - How to avoid the bad partsJavascript - How to avoid the bad parts
Javascript - How to avoid the bad parts
 
잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기
 
Sequel @ madrid-rb
Sequel @  madrid-rbSequel @  madrid-rb
Sequel @ madrid-rb
 
永不止步的“重构”
永不止步的“重构”永不止步的“重构”
永不止步的“重构”
 
A Tour Through the Groovy Ecosystem
A Tour Through the Groovy EcosystemA Tour Through the Groovy Ecosystem
A Tour Through the Groovy Ecosystem
 
让开发也懂前端
让开发也懂前端让开发也懂前端
让开发也懂前端
 
Parser combinators
Parser combinatorsParser combinators
Parser combinators
 
Xml
XmlXml
Xml
 
Clouds against the Floods (RubyConfBR2011)
Clouds against the Floods (RubyConfBR2011) Clouds against the Floods (RubyConfBR2011)
Clouds against the Floods (RubyConfBR2011)
 
Ext GWT 3.0 Advanced Templates
Ext GWT 3.0 Advanced TemplatesExt GWT 3.0 Advanced Templates
Ext GWT 3.0 Advanced Templates
 
HTML5 & CSS3 in Drupal (on the Bayou)
HTML5 & CSS3 in Drupal (on the Bayou)HTML5 & CSS3 in Drupal (on the Bayou)
HTML5 & CSS3 in Drupal (on the Bayou)
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_many
 
The Solar Framework for PHP
The Solar Framework for PHPThe Solar Framework for PHP
The Solar Framework for PHP
 
The Easy Way - Plone Conference 2011
The Easy Way - Plone Conference 2011The Easy Way - Plone Conference 2011
The Easy Way - Plone Conference 2011
 
Xml presentation
Xml presentationXml presentation
Xml presentation
 

Último

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Último (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Web Scraping using Diazo!