SlideShare uma empresa Scribd logo
1 de 40
Web Scraping with PHP Matthew Turland September 16, 2008
Everyone acquainted? ,[object Object],[object Object],[object Object]
What is Web Scraping? 2 Stage Process Stage 1 : Retrieval GET  /some/resource ... HTTP/1.1 200  OK ... Resource with data  you want Stage 2 : Analysis Raw resource Usable data
How is it different from... Data Mining Focus in data mining Focus in web scraping Consuming Web Services Web service data formats Web scraping data formats
Potential Applications What Data source When Web service is unavailable or data  access is one-time only. Crawlers and indexers Remote data search offers no  capabilities for search or data source  integration. Integration testing Applications must be tested by  simulating client behavior and  ensuring responses are consistent with requirements.
Disadvantages Δ == vs
Legal Concerns TOS TOU EUA Original source Illegal syndicate IANAL!
Retrieval GET  /some/resource ... sizeof( ) == sizeof( ) if( ) require ;
The Low-Down on HTTP
Know enough HTTP to... Use one like this: To do this:
Know enough HTTP to... PEAR::HTTP_Client pecl_http Zend_Http_Client Learn to use and troubleshoot one like this: Or roll your own! cURL Filesystem  +  Streams
[object Object],[object Object],Let's GET Started method  or  operation URI  address for the  desired  resource protocol version  in  use by the client header  name header  value request line header more headers follow...
Warning about GET In principle: "Let's do this by the book." GET In reality: "' Safe operation '? Whatever." GET
URI vs URL 1. Uniquely identifies a resource 2. Indicates how to locate a resource 3. Does both and is thus human-usable. URI URL More info in  RFC 3986  Sections  1.1.3  and  1.2.2
Query Strings http://en.wikipedia.org/w/index.php? title=Query_string&action=edit URL Query String Question mark  to separate the resource address and  query string Equal signs  to separate parameter names and respective values Ampersands  to separate  parameter  name-value pairs. Parameter Value
URL Encoding Parameter Value first second this is a field was it clear enough (already)? Query String first=this+is+a+field&second=was+it+clear+%28already%29%3F Also called  percent encoding . urlencode  and  urlencode : Handy  PHP URL functions $_SERVER ['QUERY_STRING'] /  http_build_query ( $_GET ) More info on URL encoding in  RFC 3986   Section 2.1
POST Requests Most Common HTTP Operations 1. GET 2. POST ... /w/index.php POST /new/resource -or- /updated/resource GET /some/resource HTTP/1.1 Header: Value ... POST /some/resource HTTP/1.1 Header: Value request body none
POST Request Example ,[object Object],[object Object],[object Object],Blank line  separates request headers and body Content type  for data submitted via HTML form (multipart/form-data for  file uploads ) Request body ... look familiar? Note : Most browsers have a query string length limit. Lowest known common denominator: IE7 – strlen(entire URL) <= 2,048 bytes. This limit is not standardized and only  applies to query strings, not request bodies.
HEAD Request ,[object Object],Same as GET with two exceptions: 1 ,[object Object],2 No response body HEAD vs GET Sometimes headers are all you want ?
Responses HTTP/1.0 200 OK Server: Apache X-Powered-By: PHP/5.2.5 ... [body] Lowest  protocol version required to process the response Response status code Response status description Status line Same header format as requests, but different  headers are used (see  RFC 2616 Section 14 )
Response Status Codes ,[object Object],[object Object],[object Object],[object Object],[object Object],See  RFC 2616 Section 10  for more info.
Headers Set-Cookie Cookie Location Watch out for  infinite loops! Last-Modified If-Modified-Since 304 Not Modified ETag If-None-Match OR See  RFC 2109  or  RFC 2965 for more info.
More Headers WWW-Authenticate Authorization User-Agent 200 OK / 403 Forbidden See  RFC 2617 for more info. User-Agent: Some servers perform user agent sniffing Some clients perform user agent spoofing
Best Practices ,[object Object],[object Object],[object Object]
Simple Streams Examples $uri = 'http://www.example.com/some/resource'; $get =  file_get_contents ($uri); $context =  stream_context_create (array('http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ))); // Last 2 parameters here also apply to fopen() $post =  file_get_contents ($uri, false, $context);
Streams Resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
pecl_http Examples $http = new  HttpRequest ($uri); $http-> enableCookies (); $http-> setMethod (HTTP_METH_POST); // or HTTP_METH_GET $http-> addPostFields ($postData); $http-> setOptions (array( 'httpauth' => $username . ':' . $password, 'httpauthtype' => HTTP_AUTH_BASIC, ‘ useragent’ => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer', 'range' => array(array(1, 5), array(10, 15)) )); $response = $http-> send (); $headers = $response-> getHeaders (); $body = $response-> getBody (); See  PHP Manual for more info.
PEAR::HTTP_Client Examples $cookiejar = new  HTTP_Client_CookieManager (); $request = new  HTTP_Request ($uri); $request-> setMethod (HTTP_REQUEST_METHOD_POST); $request-> setBasicAuth ($username, $password); $request-> addHeader ('User-Agent', $userAgent); $request-> addHeader ('Referer', $referrer); $request-> addHeader ('Range', 'bytes=2-3,5-6'); foreach ($postData as $key => $value) $request-> addPostData ($key, $value); $request-> sendRequest (); $cookiejar-> updateCookies ($request); $request = new  HTTP_Request ($otheruri); $cookiejar-> passCookies ($request); $response = $request-> sendRequest (); $headers = $request->getResponseHeader(); $body = $request->getResponseBody(); See  PEAR Manual   and  API Docs   for more info.
Zend_Http_Client Examples $client = new  Zend_Http_Client ($uri); $client-> setMethod (Zend_Http_Client::POST); $client-> setAuth ($username, $password); $client-> setHeaders ('User-Agent', $userAgent); $client-> setHeaders (array( 'Referer' => $referrer, 'Range' => 'bytes=2-3,5-6' ); $client-> setParameterPost ($postData); $client-> setCookieJar (); $client-> request (); $client-> setUri ($otheruri); $client-> setMethod (Zend_Http_Client::GET); $response = $client-> request (); $headers = $response-> getHeaders (); $body = $response-> getBody (); See  ZF Manual for more info.
cURL Examples Fatal error: Allowed memory size of n00b bytes  exhausted (tried to allocate 1337 bytes) in  /this/slide.php on line 1 See  PHP Manual ,  Context Options , or  my php|architect article   for more info. Just kidding. Really, the equivalent cURL code for the  previous examples is so verbose that it won't fit on one slide and I don't think it's deserving of multiple slides.
HTTP Resources ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Analysis Raw resource Usable data DOM XMLReader SimpleXML XSL tidy PCRE String functions JSON ctype XML Parser
Cleanup ,[object Object],[object Object],[object Object],[object Object]
Parsing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Validation ,[object Object],[object Object],[object Object],[object Object]
Transformation ,[object Object],[object Object],[object Object]
Abstraction ,[object Object],[object Object],[object Object]
Assertions ,[object Object],[object Object],[object Object],[object Object]
Testing ,[object Object],[object Object],[object Object]
Questions? ,[object Object],[object Object],[object Object],[object Object]

Mais conteúdo relacionado

Mais procurados

05 File Handling Upload Mysql
05 File Handling Upload Mysql05 File Handling Upload Mysql
05 File Handling Upload MysqlGeshan Manandhar
 
Php Simple Xml
Php Simple XmlPhp Simple Xml
Php Simple Xmlmussawir20
 
Session Server - Maintaing State between several Servers
Session Server - Maintaing State between several ServersSession Server - Maintaing State between several Servers
Session Server - Maintaing State between several ServersStephan Schmidt
 
DrupalCon Chicago Practical MongoDB and Drupal
DrupalCon Chicago Practical MongoDB and DrupalDrupalCon Chicago Practical MongoDB and Drupal
DrupalCon Chicago Practical MongoDB and DrupalDoug Green
 
PHP And Web Services: Perfect Partners
PHP And Web Services: Perfect PartnersPHP And Web Services: Perfect Partners
PHP And Web Services: Perfect PartnersLorna Mitchell
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocratJonathan Linowes
 
RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座Li Yi
 
Intro to php
Intro to phpIntro to php
Intro to phpSp Singh
 
Modern Web Development with Perl
Modern Web Development with PerlModern Web Development with Perl
Modern Web Development with PerlDave Cross
 
Cakefest 2010: API Development
Cakefest 2010: API DevelopmentCakefest 2010: API Development
Cakefest 2010: API DevelopmentAndrew Curioso
 
Open Source Package PHP & MySQL
Open Source Package PHP & MySQLOpen Source Package PHP & MySQL
Open Source Package PHP & MySQLkalaisai
 
Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Stephan Schmidt
 
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARStephan Schmidt
 
Class 6 - PHP Web Programming
Class 6 - PHP Web ProgrammingClass 6 - PHP Web Programming
Class 6 - PHP Web ProgrammingAhmed Swilam
 

Mais procurados (20)

05 File Handling Upload Mysql
05 File Handling Upload Mysql05 File Handling Upload Mysql
05 File Handling Upload Mysql
 
Php
PhpPhp
Php
 
Advanced Json
Advanced JsonAdvanced Json
Advanced Json
 
Php Simple Xml
Php Simple XmlPhp Simple Xml
Php Simple Xml
 
Session Server - Maintaing State between several Servers
Session Server - Maintaing State between several ServersSession Server - Maintaing State between several Servers
Session Server - Maintaing State between several Servers
 
DrupalCon Chicago Practical MongoDB and Drupal
DrupalCon Chicago Practical MongoDB and DrupalDrupalCon Chicago Practical MongoDB and Drupal
DrupalCon Chicago Practical MongoDB and Drupal
 
PHP And Web Services: Perfect Partners
PHP And Web Services: Perfect PartnersPHP And Web Services: Perfect Partners
PHP And Web Services: Perfect Partners
 
12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat12 core technologies you should learn, love, and hate to be a 'real' technocrat
12 core technologies you should learn, love, and hate to be a 'real' technocrat
 
RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座RESTful SOA - 中科院暑期讲座
RESTful SOA - 中科院暑期讲座
 
Intro to php
Intro to phpIntro to php
Intro to php
 
Intro to PHP
Intro to PHPIntro to PHP
Intro to PHP
 
Modern Web Development with Perl
Modern Web Development with PerlModern Web Development with Perl
Modern Web Development with Perl
 
PHP POWERPOINT SLIDES
PHP POWERPOINT SLIDESPHP POWERPOINT SLIDES
PHP POWERPOINT SLIDES
 
Cakefest 2010: API Development
Cakefest 2010: API DevelopmentCakefest 2010: API Development
Cakefest 2010: API Development
 
Open Source Package PHP & MySQL
Open Source Package PHP & MySQLOpen Source Package PHP & MySQL
Open Source Package PHP & MySQL
 
Copy of cgi
Copy of cgiCopy of cgi
Copy of cgi
 
Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5Go OO! - Real-life Design Patterns in PHP 5
Go OO! - Real-life Design Patterns in PHP 5
 
XML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEARXML and Web Services with PHP5 and PEAR
XML and Web Services with PHP5 and PEAR
 
PEAR For The Masses
PEAR For The MassesPEAR For The Masses
PEAR For The Masses
 
Class 6 - PHP Web Programming
Class 6 - PHP Web ProgrammingClass 6 - PHP Web Programming
Class 6 - PHP Web Programming
 

Semelhante a Web Scraping with PHP

Ellerslie User Group - ReST Presentation
Ellerslie User Group - ReST PresentationEllerslie User Group - ReST Presentation
Ellerslie User Group - ReST PresentationAlex Henderson
 
Spring Boot and REST API
Spring Boot and REST APISpring Boot and REST API
Spring Boot and REST API07.pallav
 
01. http basics v27
01. http basics v2701. http basics v27
01. http basics v27Eoin Keary
 
KMUTNB - Internet Programming 2/7
KMUTNB - Internet Programming 2/7KMUTNB - Internet Programming 2/7
KMUTNB - Internet Programming 2/7phuphax
 
Resource-Oriented Web Services
Resource-Oriented Web ServicesResource-Oriented Web Services
Resource-Oriented Web ServicesBradley Holt
 
RESTful Web Services with JAX-RS
RESTful Web Services with JAX-RSRESTful Web Services with JAX-RS
RESTful Web Services with JAX-RSCarol McDonald
 
Advanced SEO for Web Developers
Advanced SEO for Web DevelopersAdvanced SEO for Web Developers
Advanced SEO for Web DevelopersNathan Buggia
 
Chapter 1.Web Techniques_Notes.pptx
Chapter 1.Web Techniques_Notes.pptxChapter 1.Web Techniques_Notes.pptx
Chapter 1.Web Techniques_Notes.pptxShitalGhotekar
 
Spring MVC 3 Restful
Spring MVC 3 RestfulSpring MVC 3 Restful
Spring MVC 3 Restfulknight1128
 
Rest with Java EE 6 , Security , Backbone.js
Rest with Java EE 6 , Security , Backbone.jsRest with Java EE 6 , Security , Backbone.js
Rest with Java EE 6 , Security , Backbone.jsCarol McDonald
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
REST-API introduction for developers
REST-API introduction for developersREST-API introduction for developers
REST-API introduction for developersPatrick Savalle
 

Semelhante a Web Scraping with PHP (20)

Ellerslie User Group - ReST Presentation
Ellerslie User Group - ReST PresentationEllerslie User Group - ReST Presentation
Ellerslie User Group - ReST Presentation
 
Spring Boot and REST API
Spring Boot and REST APISpring Boot and REST API
Spring Boot and REST API
 
01. http basics v27
01. http basics v2701. http basics v27
01. http basics v27
 
HTTP Basics Demo
HTTP Basics DemoHTTP Basics Demo
HTTP Basics Demo
 
KMUTNB - Internet Programming 2/7
KMUTNB - Internet Programming 2/7KMUTNB - Internet Programming 2/7
KMUTNB - Internet Programming 2/7
 
Resource-Oriented Web Services
Resource-Oriented Web ServicesResource-Oriented Web Services
Resource-Oriented Web Services
 
Web services tutorial
Web services tutorialWeb services tutorial
Web services tutorial
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Web Services Tutorial
Web Services TutorialWeb Services Tutorial
Web Services Tutorial
 
Switch to Backend 2023
Switch to Backend 2023Switch to Backend 2023
Switch to Backend 2023
 
Coding In Php
Coding In PhpCoding In Php
Coding In Php
 
RESTful Web Services with JAX-RS
RESTful Web Services with JAX-RSRESTful Web Services with JAX-RS
RESTful Web Services with JAX-RS
 
Advanced SEO for Web Developers
Advanced SEO for Web DevelopersAdvanced SEO for Web Developers
Advanced SEO for Web Developers
 
Chapter 1.Web Techniques_Notes.pptx
Chapter 1.Web Techniques_Notes.pptxChapter 1.Web Techniques_Notes.pptx
Chapter 1.Web Techniques_Notes.pptx
 
Spring MVC 3 Restful
Spring MVC 3 RestfulSpring MVC 3 Restful
Spring MVC 3 Restful
 
Rest with Java EE 6 , Security , Backbone.js
Rest with Java EE 6 , Security , Backbone.jsRest with Java EE 6 , Security , Backbone.js
Rest with Java EE 6 , Security , Backbone.js
 
Rest
RestRest
Rest
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
REST-API introduction for developers
REST-API introduction for developersREST-API introduction for developers
REST-API introduction for developers
 
Cqrs api v2
Cqrs api v2Cqrs api v2
Cqrs api v2
 

Mais de Matthew Turland

New SPL Features in PHP 5.3
New SPL Features in PHP 5.3New SPL Features in PHP 5.3
New SPL Features in PHP 5.3Matthew Turland
 
New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)Matthew Turland
 
Open Source Networking with Vyatta
Open Source Networking with VyattaOpen Source Networking with Vyatta
Open Source Networking with VyattaMatthew Turland
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management SystemsMatthew Turland
 
PHP Basics for Designers
PHP Basics for DesignersPHP Basics for Designers
PHP Basics for DesignersMatthew Turland
 
Creating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandCreating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandMatthew Turland
 
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleThe OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleMatthew Turland
 
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan FusilierUtilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan FusilierMatthew Turland
 
The Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan FarnellThe Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan FarnellMatthew Turland
 
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank DucrestPDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank DucrestMatthew Turland
 
Getting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew TurlandGetting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew TurlandMatthew Turland
 

Mais de Matthew Turland (13)

New SPL Features in PHP 5.3
New SPL Features in PHP 5.3New SPL Features in PHP 5.3
New SPL Features in PHP 5.3
 
New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)New SPL Features in PHP 5.3 (TEK-X)
New SPL Features in PHP 5.3 (TEK-X)
 
Sinatra
SinatraSinatra
Sinatra
 
Web Scraping with PHP
Web Scraping with PHPWeb Scraping with PHP
Web Scraping with PHP
 
Open Source Networking with Vyatta
Open Source Networking with VyattaOpen Source Networking with Vyatta
Open Source Networking with Vyatta
 
Open Source Content Management Systems
Open Source Content Management SystemsOpen Source Content Management Systems
Open Source Content Management Systems
 
PHP Basics for Designers
PHP Basics for DesignersPHP Basics for Designers
PHP Basics for Designers
 
Creating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew TurlandCreating Web Services with Zend Framework - Matthew Turland
Creating Web Services with Zend Framework - Matthew Turland
 
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake DevilleThe OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
The OpenSolaris Operating System and Sun xVM VirtualBox - Blake Deville
 
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan FusilierUtilizing the Xen Hypervisor in business practice - Bryan Fusilier
Utilizing the Xen Hypervisor in business practice - Bryan Fusilier
 
The Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan FarnellThe Ruby Programming Language - Ryan Farnell
The Ruby Programming Language - Ryan Farnell
 
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank DucrestPDQ Programming Languages plus an overview of Alice - Frank Ducrest
PDQ Programming Languages plus an overview of Alice - Frank Ducrest
 
Getting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew TurlandGetting Involved in Open Source - Matthew Turland
Getting Involved in Open Source - Matthew Turland
 

Último

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 

Último (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Web Scraping with PHP

  • 1. Web Scraping with PHP Matthew Turland September 16, 2008
  • 2.
  • 3. What is Web Scraping? 2 Stage Process Stage 1 : Retrieval GET /some/resource ... HTTP/1.1 200 OK ... Resource with data you want Stage 2 : Analysis Raw resource Usable data
  • 4. How is it different from... Data Mining Focus in data mining Focus in web scraping Consuming Web Services Web service data formats Web scraping data formats
  • 5. Potential Applications What Data source When Web service is unavailable or data access is one-time only. Crawlers and indexers Remote data search offers no capabilities for search or data source integration. Integration testing Applications must be tested by simulating client behavior and ensuring responses are consistent with requirements.
  • 7. Legal Concerns TOS TOU EUA Original source Illegal syndicate IANAL!
  • 8. Retrieval GET /some/resource ... sizeof( ) == sizeof( ) if( ) require ;
  • 10. Know enough HTTP to... Use one like this: To do this:
  • 11. Know enough HTTP to... PEAR::HTTP_Client pecl_http Zend_Http_Client Learn to use and troubleshoot one like this: Or roll your own! cURL Filesystem + Streams
  • 12.
  • 13. Warning about GET In principle: &quot;Let's do this by the book.&quot; GET In reality: &quot;' Safe operation '? Whatever.&quot; GET
  • 14. URI vs URL 1. Uniquely identifies a resource 2. Indicates how to locate a resource 3. Does both and is thus human-usable. URI URL More info in RFC 3986 Sections 1.1.3 and 1.2.2
  • 15. Query Strings http://en.wikipedia.org/w/index.php? title=Query_string&action=edit URL Query String Question mark to separate the resource address and query string Equal signs to separate parameter names and respective values Ampersands to separate parameter name-value pairs. Parameter Value
  • 16. URL Encoding Parameter Value first second this is a field was it clear enough (already)? Query String first=this+is+a+field&second=was+it+clear+%28already%29%3F Also called percent encoding . urlencode and urlencode : Handy PHP URL functions $_SERVER ['QUERY_STRING'] / http_build_query ( $_GET ) More info on URL encoding in RFC 3986 Section 2.1
  • 17. POST Requests Most Common HTTP Operations 1. GET 2. POST ... /w/index.php POST /new/resource -or- /updated/resource GET /some/resource HTTP/1.1 Header: Value ... POST /some/resource HTTP/1.1 Header: Value request body none
  • 18.
  • 19.
  • 20. Responses HTTP/1.0 200 OK Server: Apache X-Powered-By: PHP/5.2.5 ... [body] Lowest protocol version required to process the response Response status code Response status description Status line Same header format as requests, but different headers are used (see RFC 2616 Section 14 )
  • 21.
  • 22. Headers Set-Cookie Cookie Location Watch out for infinite loops! Last-Modified If-Modified-Since 304 Not Modified ETag If-None-Match OR See RFC 2109 or RFC 2965 for more info.
  • 23. More Headers WWW-Authenticate Authorization User-Agent 200 OK / 403 Forbidden See RFC 2617 for more info. User-Agent: Some servers perform user agent sniffing Some clients perform user agent spoofing
  • 24.
  • 25. Simple Streams Examples $uri = 'http://www.example.com/some/resource'; $get = file_get_contents ($uri); $context = stream_context_create (array('http' => array( 'method' => 'POST', 'header' => 'Content-Type: ' . 'application/x-www-form-urlencoded', 'content' => http_build_query(array( 'var1' => 'value1', 'var2' => 'value2' )) ))); // Last 2 parameters here also apply to fopen() $post = file_get_contents ($uri, false, $context);
  • 26.
  • 27. pecl_http Examples $http = new HttpRequest ($uri); $http-> enableCookies (); $http-> setMethod (HTTP_METH_POST); // or HTTP_METH_GET $http-> addPostFields ($postData); $http-> setOptions (array( 'httpauth' => $username . ':' . $password, 'httpauthtype' => HTTP_AUTH_BASIC, ‘ useragent’ => 'PHP ' . phpversion(), 'referer' => 'http://example.com/some/referer', 'range' => array(array(1, 5), array(10, 15)) )); $response = $http-> send (); $headers = $response-> getHeaders (); $body = $response-> getBody (); See PHP Manual for more info.
  • 28. PEAR::HTTP_Client Examples $cookiejar = new HTTP_Client_CookieManager (); $request = new HTTP_Request ($uri); $request-> setMethod (HTTP_REQUEST_METHOD_POST); $request-> setBasicAuth ($username, $password); $request-> addHeader ('User-Agent', $userAgent); $request-> addHeader ('Referer', $referrer); $request-> addHeader ('Range', 'bytes=2-3,5-6'); foreach ($postData as $key => $value) $request-> addPostData ($key, $value); $request-> sendRequest (); $cookiejar-> updateCookies ($request); $request = new HTTP_Request ($otheruri); $cookiejar-> passCookies ($request); $response = $request-> sendRequest (); $headers = $request->getResponseHeader(); $body = $request->getResponseBody(); See PEAR Manual and API Docs for more info.
  • 29. Zend_Http_Client Examples $client = new Zend_Http_Client ($uri); $client-> setMethod (Zend_Http_Client::POST); $client-> setAuth ($username, $password); $client-> setHeaders ('User-Agent', $userAgent); $client-> setHeaders (array( 'Referer' => $referrer, 'Range' => 'bytes=2-3,5-6' ); $client-> setParameterPost ($postData); $client-> setCookieJar (); $client-> request (); $client-> setUri ($otheruri); $client-> setMethod (Zend_Http_Client::GET); $response = $client-> request (); $headers = $response-> getHeaders (); $body = $response-> getBody (); See ZF Manual for more info.
  • 30. cURL Examples Fatal error: Allowed memory size of n00b bytes exhausted (tried to allocate 1337 bytes) in /this/slide.php on line 1 See PHP Manual , Context Options , or my php|architect article for more info. Just kidding. Really, the equivalent cURL code for the previous examples is so verbose that it won't fit on one slide and I don't think it's deserving of multiple slides.
  • 31.
  • 32. Analysis Raw resource Usable data DOM XMLReader SimpleXML XSL tidy PCRE String functions JSON ctype XML Parser
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.