SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Web Scraping
                               @alvaro_aguirre




Saturday, November 5, 2011
In search of our cosmic origins...




Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Data Scraping
                                  vs
                             Web Scraping


Saturday, November 5, 2011
Data Scraping
                    <html>
                             <header></header>
                             <body>
                             .....
                             </body>
                    </html>




Saturday, November 5, 2011
Web Scraping




Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Deliverance
                               XDV
                             Diazo



Saturday, November 5, 2011
Diazo




Saturday, November 5, 2011
Saturday, November 5, 2011
<replace css:content=”h1” css:theme=”#main” />




Saturday, November 5, 2011
<drop css:content=”h1” />

                             <drop css:theme=”breadcrumbs” />




Saturday, November 5, 2011
<replace css:theme=”#header” content=”#header-
                       element” if-content=”” />




Saturday, November 5, 2011
<drop css:theme="#info-box" if-path="/news"/>




Saturday, November 5, 2011
<theme/>
                         <notheme/>
                         <replace/>
                         <before/>
                         <after/>
                         <drop/>
                         <strip/>
                         <merge/>
                         <copy/>




Saturday, November 5, 2011
<replace css:theme="#details">
          <dl id="details">
            <xsl:for-each css:select="table#details > tr">
              <dt><xsl:copy-of select="td[1]/text()" /></dt>
              <dd><xsl:copy-of select="td[2]/node()"/></dd>
            </xsl:for-each>
          </dl>
       </replace>/></dt>




      <table id="details">
                                              <dl id="details">
          <tr>
                                                  <dt>One</dt>
               <td>One</td>
                                                  <dd>1</dd>
               <td>1</td>
                                                  <dt>Two</dt>
          </tr>
                                                  <dd>2</dd>
          <tr>
                                              </dl>
               <td>Two</td>
               <td>2</td>
          </tr>
      </table>




Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Saturday, November 5, 2011
Tools



Saturday, November 5, 2011
External Content




Saturday, November 5, 2011
Saturday, November 5, 2011
• development of web & mobile interfaces
                    • legacy apps integrations
                    • prototypes
                    • low coupling


Saturday, November 5, 2011
from diazo.compiler import compile_theme
                             from lxml import etree
                             from diazo.compiler import compile_theme

                             absolute_prefix = "/static"

                             rules = "rules.xml"
                             theme = "theme.html"

                             compiled_theme = compile_theme(rules, theme,
                                                            absolute_prefix=absolute_prefix)

                             transform = etree.XSLT(compiled_theme)
                             content = etree.parse(some_content)
                             transformed = transform(content)

                             output = etree.tostring(transformed)




Saturday, November 5, 2011
github/aaguirre




Saturday, November 5, 2011
diazo.org




Saturday, November 5, 2011
plone.org




Saturday, November 5, 2011
gracias!




Saturday, November 5, 2011

Mais conteúdo relacionado

Semelhante a Web Scraping using Diazo!

What's new in HTML5, CSS3 and JavaScript, James Pearce
What's new in HTML5, CSS3 and JavaScript, James PearceWhat's new in HTML5, CSS3 and JavaScript, James Pearce
What's new in HTML5, CSS3 and JavaScript, James PearceSencha
 
Moving to Dojo 1.7 and the path to 2.0
Moving to Dojo 1.7 and the path to 2.0Moving to Dojo 1.7 and the path to 2.0
Moving to Dojo 1.7 and the path to 2.0James Thomas
 
international PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPinternational PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPsmueller_sandsmedia
 
JS Tooling in Rails 3.1
JS Tooling in Rails 3.1JS Tooling in Rails 3.1
JS Tooling in Rails 3.1Duda Dornelles
 
Javascript - How to avoid the bad parts
Javascript - How to avoid the bad partsJavascript - How to avoid the bad parts
Javascript - How to avoid the bad partsMikko Ohtamaa
 
잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기형우 안
 
永不止步的“重构”
永不止步的“重构”永不止步的“重构”
永不止步的“重构”Kejun Zhang
 
A Tour Through the Groovy Ecosystem
A Tour Through the Groovy EcosystemA Tour Through the Groovy Ecosystem
A Tour Through the Groovy EcosystemLeonard Axelsson
 
让开发也懂前端
让开发也懂前端让开发也懂前端
让开发也懂前端lifesinger
 
Parser combinators
Parser combinatorsParser combinators
Parser combinatorslifecoder
 
Clouds against the Floods (RubyConfBR2011)
Clouds against the Floods (RubyConfBR2011) Clouds against the Floods (RubyConfBR2011)
Clouds against the Floods (RubyConfBR2011) Leonardo Borges
 
Ext GWT 3.0 Advanced Templates
Ext GWT 3.0 Advanced TemplatesExt GWT 3.0 Advanced Templates
Ext GWT 3.0 Advanced TemplatesSencha
 
HTML5 & CSS3 in Drupal (on the Bayou)
HTML5 & CSS3 in Drupal (on the Bayou)HTML5 & CSS3 in Drupal (on the Bayou)
HTML5 & CSS3 in Drupal (on the Bayou)Mediacurrent
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyBlazing Cloud
 
The Solar Framework for PHP
The Solar Framework for PHPThe Solar Framework for PHP
The Solar Framework for PHPConFoo
 
The Easy Way - Plone Conference 2011
The Easy Way - Plone Conference 2011The Easy Way - Plone Conference 2011
The Easy Way - Plone Conference 2011Mikko Ohtamaa
 

Semelhante a Web Scraping using Diazo! (20)

Semantic chop
Semantic chopSemantic chop
Semantic chop
 
What's new in HTML5, CSS3 and JavaScript, James Pearce
What's new in HTML5, CSS3 and JavaScript, James PearceWhat's new in HTML5, CSS3 and JavaScript, James Pearce
What's new in HTML5, CSS3 and JavaScript, James Pearce
 
Moving to Dojo 1.7 and the path to 2.0
Moving to Dojo 1.7 and the path to 2.0Moving to Dojo 1.7 and the path to 2.0
Moving to Dojo 1.7 and the path to 2.0
 
international PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHPinternational PHP2011_ilia alshanetsky_Hidden Features of PHP
international PHP2011_ilia alshanetsky_Hidden Features of PHP
 
JS Tooling in Rails 3.1
JS Tooling in Rails 3.1JS Tooling in Rails 3.1
JS Tooling in Rails 3.1
 
Javascript - How to avoid the bad parts
Javascript - How to avoid the bad partsJavascript - How to avoid the bad parts
Javascript - How to avoid the bad parts
 
잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기잘 알려지지 않은 Php 코드 활용하기
잘 알려지지 않은 Php 코드 활용하기
 
Sequel @ madrid-rb
Sequel @  madrid-rbSequel @  madrid-rb
Sequel @ madrid-rb
 
永不止步的“重构”
永不止步的“重构”永不止步的“重构”
永不止步的“重构”
 
A Tour Through the Groovy Ecosystem
A Tour Through the Groovy EcosystemA Tour Through the Groovy Ecosystem
A Tour Through the Groovy Ecosystem
 
让开发也懂前端
让开发也懂前端让开发也懂前端
让开发也懂前端
 
Parser combinators
Parser combinatorsParser combinators
Parser combinators
 
Xml
XmlXml
Xml
 
Clouds against the Floods (RubyConfBR2011)
Clouds against the Floods (RubyConfBR2011) Clouds against the Floods (RubyConfBR2011)
Clouds against the Floods (RubyConfBR2011)
 
Ext GWT 3.0 Advanced Templates
Ext GWT 3.0 Advanced TemplatesExt GWT 3.0 Advanced Templates
Ext GWT 3.0 Advanced Templates
 
HTML5 & CSS3 in Drupal (on the Bayou)
HTML5 & CSS3 in Drupal (on the Bayou)HTML5 & CSS3 in Drupal (on the Bayou)
HTML5 & CSS3 in Drupal (on the Bayou)
 
Rails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_manyRails ORM De-mystifying Active Record has_many
Rails ORM De-mystifying Active Record has_many
 
The Solar Framework for PHP
The Solar Framework for PHPThe Solar Framework for PHP
The Solar Framework for PHP
 
The Easy Way - Plone Conference 2011
The Easy Way - Plone Conference 2011The Easy Way - Plone Conference 2011
The Easy Way - Plone Conference 2011
 
Xml presentation
Xml presentationXml presentation
Xml presentation
 

Último

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Último (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

Web Scraping using Diazo!