SlideShare a Scribd company logo
1 of 20
Web Scraping in Ruby*
 * for fun and profit
2 of the 6 W’s


Who?
Why?
Why Ruby?
Fun                       5.times { print
                   “I like scraping in Ruby” }
OOP
Pretty
Closures
Flexible
Interactive mode
Why Ruby?

Community
Culture of testing
Rails is a web app
  test culture + web app = web testing
Lots of libraries!
Why Scrape the Web?
Information
Research
Testing
  acceptance & integration
  performance / load
Standard API: HTTP + text
Why not?
IANAL
Legal Concerns
Copyright
  Online != Public Domain
  Fair Use
License
  AUP, EULA, TOS, TOU ...
  Example: www.dexknows.com
Trespass to chattel
Objective


Get data. Have fun. Be nice.
Request
GET /
Host: www.google.com




Response
200 OK
<html>
  <head>
     <title>My page</title>
  </head>
  <body>
     <p>da body</p>
  </body>
</html>
HTTP Status Codes

200 level - Success on client and server
300 level - Redirection - client is supposed to do
something else
400 level - Client error
500 level - Server error
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
 "http://www.w3.org/TR/html4/loose.dtd">

<html lang="en">
<head>

  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  <title>All about Joe</title>
</head>
<body>
  <div id=”header”> | <span class=”nav-link”>home</span> | </div>
  <div id=”content”>
    <div id=”sidebar”>I’m a paragraph</div>
    <div id=”body”>
       <h1>I’m a header</h1>
       <p class=”bio”>
         <img src=”/img/mugshot.png”>
         <span class=”name”>Joe Blow</span>
       </p>
       <p>Info about Joe</p>
    </div>
  </div>
  <div id=”footer”>&copy; 2009 Creative Commons</div>
</body>
</html>
CSS & XPath Selectors
Way to target data inside a document
CSS
  “head title”
  "div#body * span.name"
XPath
  /head/title
  //div[@id='body']/*/span[@class='name']
Selecting XPath Nodes

      title           selects all nodes in the doc


     /html             selects from the root node


     //title      selects all nodes below current node

     [@src]       selects nodes with an attribute and
[@class=’name’]            optionally a value
Browser Tools

Firefox
  QuarkRuby’s version of Firebug
    click to get CSS & XPath Selectors
  Firefinder extension to Firebug
    query doc with CSS Selectors
Interaction                                          Parsing

 net/http                                     regex

                 Watir family

                                hpricot & nokogiri

              Mechanize

                    webrat

                    scrubyt
Scrubyt
http://github.com/scrubber/scrubyt_examples/blob/master/google.rb


require 'rubygems'
require 'scrubyt'
 
google_data = Scrubyt::Extractor.define do
  fetch 'http://www.google.com/search?hl=en&q=ruby'
  
  link_title "//a[@class='l']", :write_text => true do
    link_url
  end
end
 
p google_data.to_hash
Watir (Safariwatir)
require 'safariwatir'

browser = Watir::Safari.new
browser.goto("http://google.com")
browser.text_field(:name, "q").set("safariwatir")
browser.button(:name, "btnI").click
puts "FAILURE" unless
browser.contains_text("software")
Webrat Scraper
   require 'webrat_scraper'

   class MyScraper < WebratScraper
     def initialize
       @url = "http://www.google.com"
       super
     end

    def first_result_for(search_term)
      visit @url

      fill_in "q", :with => search_term
      click_button

      first_link = (doc/"li.g a.l").first

       {:text => first_link.inner_text,
         :url => first_link.attributes[“href”].to_s }
     end
   end

   m = MyScraper.new
   result = m.first_result_for("webrat-mechanize")
   puts result.inspect
Resources
Ruby http://ruby-lang.org

HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html

CSS Selectors http://www.w3schools.com/Css/css_syntax.asp

XPath http://www.w3schools.com/XPath/xpath_syntax.asp

Mechanize http://mechanize.rubyforge.org/mechanize/

Nokogiri http://nokogiri.rubyforge.org/nokogiri/

Hpricot http://github.com/whymirror/hpricot

Webrat http://wiki.github.com/brynary/webrat

Scrubyt http://scrubyt.org/

Webrat Scraper http://github.com/jtzemp/webrat-scraper

TourBus http://github.com/dbrady/tourbus
Advanced Topics
Distributed Scraping
Anonymization
Security
  Captcha
  XSRF & CSRF protections
Load and Performance Testing

More Related Content

What's hot

Simple Web Apps With Sinatra
Simple Web Apps With SinatraSimple Web Apps With Sinatra
Simple Web Apps With Sinatraa_l
 
Web scraping 101 with goutte
Web scraping 101 with goutteWeb scraping 101 with goutte
Web scraping 101 with goutteJoshua Copeland
 
Basics of Front End Web Dev PowerPoint
Basics of Front End Web Dev PowerPointBasics of Front End Web Dev PowerPoint
Basics of Front End Web Dev PowerPointSahil Gandhi
 
The Need for Speed - SMX Sydney 2013
The Need for Speed - SMX Sydney 2013The Need for Speed - SMX Sydney 2013
The Need for Speed - SMX Sydney 2013Bastian Grimm
 
Seo Bootcamp for Small Buisinesses
 Seo Bootcamp for Small Buisinesses Seo Bootcamp for Small Buisinesses
Seo Bootcamp for Small BuisinessesCharlie Kalech
 
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기Jinho Jung
 
Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML PagesMike Crabb
 
On-page SEO for Drupal
On-page SEO for DrupalOn-page SEO for Drupal
On-page SEO for DrupalSvilen Sabev
 
Introduction to jQuery Mobile - Web Deliver for All
Introduction to jQuery Mobile - Web Deliver for AllIntroduction to jQuery Mobile - Web Deliver for All
Introduction to jQuery Mobile - Web Deliver for AllMarc Grabanski
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web StandardsAarron Walter
 
Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Stoyan Stefanov
 
High Performance Social Plugins
High Performance Social PluginsHigh Performance Social Plugins
High Performance Social PluginsStoyan Stefanov
 
JavaServer Pages
JavaServer Pages JavaServer Pages
JavaServer Pages profbnk
 
HTML5 and the web of tomorrow!
HTML5  and the  web of tomorrow!HTML5  and the  web of tomorrow!
HTML5 and the web of tomorrow!Christian Heilmann
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBMoshe Kaplan
 
Html5的应用与推行
Html5的应用与推行Html5的应用与推行
Html5的应用与推行Sofish Lin
 
Get Django, Get Hired - An opinionated guide to getting the best job, for the...
Get Django, Get Hired - An opinionated guide to getting the best job, for the...Get Django, Get Hired - An opinionated guide to getting the best job, for the...
Get Django, Get Hired - An opinionated guide to getting the best job, for the...Marcel Chastain
 

What's hot (20)

Simple Web Apps With Sinatra
Simple Web Apps With SinatraSimple Web Apps With Sinatra
Simple Web Apps With Sinatra
 
Web scraping 101 with goutte
Web scraping 101 with goutteWeb scraping 101 with goutte
Web scraping 101 with goutte
 
Web Components Revolution
Web Components RevolutionWeb Components Revolution
Web Components Revolution
 
Basics of Front End Web Dev PowerPoint
Basics of Front End Web Dev PowerPointBasics of Front End Web Dev PowerPoint
Basics of Front End Web Dev PowerPoint
 
The Need for Speed - SMX Sydney 2013
The Need for Speed - SMX Sydney 2013The Need for Speed - SMX Sydney 2013
The Need for Speed - SMX Sydney 2013
 
Seo Bootcamp for Small Buisinesses
 Seo Bootcamp for Small Buisinesses Seo Bootcamp for Small Buisinesses
Seo Bootcamp for Small Buisinesses
 
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
디자인 패턴과 YUI를 이용해 Rich UI 빠르게 구현하기
 
Creating HTML Pages
Creating HTML PagesCreating HTML Pages
Creating HTML Pages
 
On-page SEO for Drupal
On-page SEO for DrupalOn-page SEO for Drupal
On-page SEO for Drupal
 
Introduction to jQuery Mobile - Web Deliver for All
Introduction to jQuery Mobile - Web Deliver for AllIntroduction to jQuery Mobile - Web Deliver for All
Introduction to jQuery Mobile - Web Deliver for All
 
Findability Bliss Through Web Standards
Findability Bliss Through Web StandardsFindability Bliss Through Web Standards
Findability Bliss Through Web Standards
 
Please dont touch-3.6-jsday
Please dont touch-3.6-jsdayPlease dont touch-3.6-jsday
Please dont touch-3.6-jsday
 
Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2Progressive Downloads and Rendering - take #2
Progressive Downloads and Rendering - take #2
 
High Performance Social Plugins
High Performance Social PluginsHigh Performance Social Plugins
High Performance Social Plugins
 
JavaServer Pages
JavaServer Pages JavaServer Pages
JavaServer Pages
 
HTML5 and the web of tomorrow!
HTML5  and the  web of tomorrow!HTML5  and the  web of tomorrow!
HTML5 and the web of tomorrow!
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
HTML 5 & CSS 3
HTML 5 & CSS 3HTML 5 & CSS 3
HTML 5 & CSS 3
 
Html5的应用与推行
Html5的应用与推行Html5的应用与推行
Html5的应用与推行
 
Get Django, Get Hired - An opinionated guide to getting the best job, for the...
Get Django, Get Hired - An opinionated guide to getting the best job, for the...Get Django, Get Hired - An opinionated guide to getting the best job, for the...
Get Django, Get Hired - An opinionated guide to getting the best job, for the...
 

Viewers also liked

さらに仕事に使うRuby
さらに仕事に使うRubyさらに仕事に使うRuby
さらに仕事に使うRubyKentaro Goto
 
仕事で使うRuby
仕事で使うRuby仕事で使うRuby
仕事で使うRubyKentaro Goto
 
Hpricot GURU-SP por Jonas Alves
Hpricot GURU-SP por Jonas AlvesHpricot GURU-SP por Jonas Alves
Hpricot GURU-SP por Jonas AlvesJonas Alves
 
もっと仕事で使うRuby
もっと仕事で使うRubyもっと仕事で使うRuby
もっと仕事で使うRubyKentaro Goto
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Viewers also liked (8)

Web scraping
Web scrapingWeb scraping
Web scraping
 
さらに仕事に使うRuby
さらに仕事に使うRubyさらに仕事に使うRuby
さらに仕事に使うRuby
 
仕事で使うRuby
仕事で使うRuby仕事で使うRuby
仕事で使うRuby
 
Hpricot GURU-SP por Jonas Alves
Hpricot GURU-SP por Jonas AlvesHpricot GURU-SP por Jonas Alves
Hpricot GURU-SP por Jonas Alves
 
20世紀Ruby
20世紀Ruby20世紀Ruby
20世紀Ruby
 
HTML Parsing With Hpricot
HTML Parsing With HpricotHTML Parsing With Hpricot
HTML Parsing With Hpricot
 
もっと仕事で使うRuby
もっと仕事で使うRubyもっと仕事で使うRuby
もっと仕事で使うRuby
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar to Web Scraping In Ruby Utosc 2009.Key

It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるIt is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるSadaaki HIRAI
 
Frontend for developers
Frontend for developersFrontend for developers
Frontend for developersHernan Mammana
 
Great+Seo+Cheatsheet
Great+Seo+CheatsheetGreat+Seo+Cheatsheet
Great+Seo+Cheatsheetjeetututeja
 
The Big Picture and How to Get Started
The Big Picture and How to Get StartedThe Big Picture and How to Get Started
The Big Picture and How to Get Startedguest1af57e
 
Seo cheat sheet_2-2013
Seo cheat sheet_2-2013Seo cheat sheet_2-2013
Seo cheat sheet_2-2013ekkarthik
 
Seo cheat sheet_2-2013
Seo cheat sheet_2-2013Seo cheat sheet_2-2013
Seo cheat sheet_2-2013vijay patil
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to TornadoGavin Roy
 
The Django Web Application Framework
The Django Web Application FrameworkThe Django Web Application Framework
The Django Web Application FrameworkSimon Willison
 
Using and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middlewareUsing and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middlewareAlona Mekhovova
 
BP-6 Repository Customization Best Practices
BP-6 Repository Customization Best PracticesBP-6 Repository Customization Best Practices
BP-6 Repository Customization Best PracticesAlfresco Software
 
Html5 drupal7 with mandakini kumari(1)
Html5 drupal7 with mandakini kumari(1)Html5 drupal7 with mandakini kumari(1)
Html5 drupal7 with mandakini kumari(1)Mandakini Kumari
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2fishwarter
 
JS Lab`16. Владимир Воевидка: "Как работает браузер"
JS Lab`16. Владимир Воевидка: "Как работает браузер"JS Lab`16. Владимир Воевидка: "Как работает браузер"
JS Lab`16. Владимир Воевидка: "Как работает браузер"GeeksLab Odessa
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Esteve Castells
 

Similar to Web Scraping In Ruby Utosc 2009.Key (20)

BrightonSEO
BrightonSEOBrightonSEO
BrightonSEO
 
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考えるIt is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
It is not HTML5. but ... / HTML5ではないサイトからHTML5を考える
 
Frontend for developers
Frontend for developersFrontend for developers
Frontend for developers
 
Great+Seo+Cheatsheet
Great+Seo+CheatsheetGreat+Seo+Cheatsheet
Great+Seo+Cheatsheet
 
HTML5
HTML5HTML5
HTML5
 
The Devil and HTML5
The Devil and HTML5The Devil and HTML5
The Devil and HTML5
 
Scrapy workshop
Scrapy workshopScrapy workshop
Scrapy workshop
 
The Big Picture and How to Get Started
The Big Picture and How to Get StartedThe Big Picture and How to Get Started
The Big Picture and How to Get Started
 
Seo cheat sheet_2-2013
Seo cheat sheet_2-2013Seo cheat sheet_2-2013
Seo cheat sheet_2-2013
 
Seo cheat sheet_2-2013
Seo cheat sheet_2-2013Seo cheat sheet_2-2013
Seo cheat sheet_2-2013
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to Tornado
 
The Django Web Application Framework
The Django Web Application FrameworkThe Django Web Application Framework
The Django Web Application Framework
 
Using and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middlewareUsing and scaling Rack and Rack-based middleware
Using and scaling Rack and Rack-based middleware
 
BP-6 Repository Customization Best Practices
BP-6 Repository Customization Best PracticesBP-6 Repository Customization Best Practices
BP-6 Repository Customization Best Practices
 
Html5 drupal7 with mandakini kumari(1)
Html5 drupal7 with mandakini kumari(1)Html5 drupal7 with mandakini kumari(1)
Html5 drupal7 with mandakini kumari(1)
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
The Django Web Application Framework 2
The Django Web Application Framework 2The Django Web Application Framework 2
The Django Web Application Framework 2
 
JS Lab`16. Владимир Воевидка: "Как работает браузер"
JS Lab`16. Владимир Воевидка: "Как работает браузер"JS Lab`16. Владимир Воевидка: "Как работает браузер"
JS Lab`16. Владимир Воевидка: "Как работает браузер"
 
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
Advanced Web Scraping or How To Make Internet Your Database #seoplus2018
 
HTML 5 - Overview
HTML 5 - OverviewHTML 5 - Overview
HTML 5 - Overview
 

Recently uploaded

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Web Scraping In Ruby Utosc 2009.Key

  • 1. Web Scraping in Ruby* * for fun and profit
  • 2. 2 of the 6 W’s Who? Why?
  • 3. Why Ruby? Fun 5.times { print “I like scraping in Ruby” } OOP Pretty Closures Flexible Interactive mode
  • 4. Why Ruby? Community Culture of testing Rails is a web app test culture + web app = web testing Lots of libraries!
  • 5. Why Scrape the Web? Information Research Testing acceptance & integration performance / load Standard API: HTTP + text Why not?
  • 7. Legal Concerns Copyright Online != Public Domain Fair Use License AUP, EULA, TOS, TOU ... Example: www.dexknows.com Trespass to chattel
  • 8. Objective Get data. Have fun. Be nice.
  • 9. Request GET / Host: www.google.com Response 200 OK <html> <head> <title>My page</title> </head> <body> <p>da body</p> </body> </html>
  • 10. HTTP Status Codes 200 level - Success on client and server 300 level - Redirection - client is supposed to do something else 400 level - Client error 500 level - Server error
  • 11. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <title>All about Joe</title> </head> <body> <div id=”header”> | <span class=”nav-link”>home</span> | </div> <div id=”content”> <div id=”sidebar”>I’m a paragraph</div> <div id=”body”> <h1>I’m a header</h1> <p class=”bio”> <img src=”/img/mugshot.png”> <span class=”name”>Joe Blow</span> </p> <p>Info about Joe</p> </div> </div> <div id=”footer”>&copy; 2009 Creative Commons</div> </body> </html>
  • 12. CSS & XPath Selectors Way to target data inside a document CSS “head title” "div#body * span.name" XPath /head/title //div[@id='body']/*/span[@class='name']
  • 13. Selecting XPath Nodes title selects all nodes in the doc /html selects from the root node //title selects all nodes below current node [@src] selects nodes with an attribute and [@class=’name’] optionally a value
  • 14. Browser Tools Firefox QuarkRuby’s version of Firebug click to get CSS & XPath Selectors Firefinder extension to Firebug query doc with CSS Selectors
  • 15. Interaction Parsing net/http regex Watir family hpricot & nokogiri Mechanize webrat scrubyt
  • 16. Scrubyt http://github.com/scrubber/scrubyt_examples/blob/master/google.rb require 'rubygems' require 'scrubyt'   google_data = Scrubyt::Extractor.define do   fetch 'http://www.google.com/search?hl=en&q=ruby'      link_title "//a[@class='l']", :write_text => true do     link_url   end end   p google_data.to_hash
  • 17. Watir (Safariwatir) require 'safariwatir' browser = Watir::Safari.new browser.goto("http://google.com") browser.text_field(:name, "q").set("safariwatir") browser.button(:name, "btnI").click puts "FAILURE" unless browser.contains_text("software")
  • 18. Webrat Scraper require 'webrat_scraper' class MyScraper < WebratScraper def initialize @url = "http://www.google.com" super end def first_result_for(search_term) visit @url fill_in "q", :with => search_term click_button first_link = (doc/"li.g a.l").first {:text => first_link.inner_text, :url => first_link.attributes[“href”].to_s } end end m = MyScraper.new result = m.first_result_for("webrat-mechanize") puts result.inspect
  • 19. Resources Ruby http://ruby-lang.org HTTP Status Codes http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html CSS Selectors http://www.w3schools.com/Css/css_syntax.asp XPath http://www.w3schools.com/XPath/xpath_syntax.asp Mechanize http://mechanize.rubyforge.org/mechanize/ Nokogiri http://nokogiri.rubyforge.org/nokogiri/ Hpricot http://github.com/whymirror/hpricot Webrat http://wiki.github.com/brynary/webrat Scrubyt http://scrubyt.org/ Webrat Scraper http://github.com/jtzemp/webrat-scraper TourBus http://github.com/dbrady/tourbus
  • 20. Advanced Topics Distributed Scraping Anonymization Security Captcha XSRF & CSRF protections Load and Performance Testing

Editor's Notes

  1. Civil Law, Contract Law &amp; Tort Law Copyright is civil law with a lot of case law &amp; precedent TOS, AUP, etc. are contract law Tresspass to chattel tort law - have to &amp;#x2018;damage&amp;#x2019; Dex knows: look at TOU for automated scraping, robots.txt and the sitemap