SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Crawling with NodeJS
JSMeetup2@Paris 24.11.2010
@sylvinus
Crawling?
Web crawling
Grab
Process
Store
?
NodeJS
Server-side Javascript
Async / Event-driven / Reactor pattern
Small stdlib, Exploding module ecosystem
Why?
Boldly going where no one has g...
Threads vs. Async
ZOMG Server-side CSS3 selectors!
Apricot
https://github.com/silentrob/Apricot
HTML/DOM Parser, inspired by Hpricot
Sizzle + JSDOM + XUI
Problems w/ Apricot
if (file.match(/^https?:///)) {
    var urlInfo = url.parse(file, parseQueryString=false),
    host = http.createClient(((urlInfo.protocol === 'http:') ? 80 : 443),
urlInfo.hostname),
    req_url = urlInfo.pathname;
    if (urlInfo.search) {
      req_url += urlInfo.search;
    }
    var request = host.request('GET', req_url, { host: urlInfo.hostname });
    request.addListener('response', function (response) {
      var data = '';
      response.addListener('data', function (chunk) {
        data += chunk;
      });
      response.addListener("end", function() {
        fnLoaderHandle(null, data);
      });
    });
    if (request.end) {
      request.end();
    } else {
      request.close();
    }
      
  } else {
    fs.readFile(file, encoding='utf8', fnLoaderHandle);
  }
Problems w/ Apricot
No advanced HTTP client in Node’s lib
npm install request
https + redirects + buffering
Problems w/ Apricot
Apricot.parse("<p id='test'>An HTML Fragment</p>", function(doc) {
doc.find("selector"); // Populates internal collection, See Sizzle selector syntax (rules)
doc.each(callback); // Itterates over the collection, applying a callback to each match
(element)
doc.remove(); // Removes all elements in the internal collection (See XUI Syntax)
doc.inner("fragment"); // See XUI Syntax
doc.outer("fragment"); // See XUI Syntax
doc.top("fragment"); // See XUI Syntax
doc.bottom("fragment"); // See XUI Syntax
doc.before("fragment"); // See XUI Syntax
doc.after("fragment"); // See XUI Syntax
doc.hasClass("class"); // See XUI Syntax
doc.addClass("class"); // See XUI Syntax
doc.removeClass("class"); // See XUI Syntax
doc.toHTML; // Returns the HTML
doc.innerHTML; // Returns the innerHTML of the body.
doc.toDOM; // Returns the DOM representation
// Most methods are chainable, so this works
doc.find("selector").addClass('foo').after(", just because");
});
Problems w/ Apricot
XUI api?!
jQuery please :)
require("jsdom").jQueryify !!
var jsdom = require("jsdom"),
window = jsdom.jsdom().createWindow();
jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js' , function() {
window.$('body').append('<div class="testing">Hello World, It works</div>');
console.log(window.$('.testing').text());
});
Concurrency?
https://github.com/coopernurse/node-pool
npm install generic-pool
generic-pool
// Create a MySQL connection pool with
// a max of 10 connections and a 30 second max idle time
var poolModule = require('generic-pool');
var pool = poolModule.Pool({
name : 'mysql',
create : function(callback) {
var Client = require('mysql').Client;
var c = new Client();
c.user = 'scott';
c.password = 'tiger';
c.database = 'mydb';
c.connect();
callback(c);
},
destroy : function(client) { client.end(); },
max : 10,
idleTimeoutMillis : 30000,
log : false
});
// borrow connection - callback function is called
// once a resource becomes available
pool.borrow(function(client) {
client.query("select * from foo", [], function() {
// return object back to pool
pool.returnToPool(client);
});
});
So what?
Apricot - XUI + jQuery
+ request + generic-pool
+ qunit + ?
=
??
Simple API?
var Crawler = require("node-crawler").Crawler;
var c = new Crawler({
"maxConnections":10,
"timeout":60,
"defaultHandler":function(error,result,$) {
$("#content a:link").each(function(a) {
c.queue(a.href);
})
}
});
c.queue(["http://jamendo.com/","http://tedxparis.com", ...]);
c.queue([{
"uri":"http://parisjs.org/register",
"method":"POST"
"handler":function(error,result,$) {
$("div:contains(Thank you)").after(" very much");
}
}]);
Name contest! :)
node-crawler ?
Crawly ?
?????
Thanks!
First code on github tonight
Help & Forks welcomed
(We’re hiring HTML5/JS hackers ;-)
Also, http://html5weekend.org/

Mais conteúdo relacionado

Mais procurados

The Promised Land (in Angular)
The Promised Land (in Angular)The Promised Land (in Angular)
The Promised Land (in Angular)Domenic Denicola
 
Avoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promisesAvoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promisesAnkit Agarwal
 
Node.js in action
Node.js in actionNode.js in action
Node.js in actionSimon Su
 
async/await in Swift
async/await in Swiftasync/await in Swift
async/await in SwiftPeter Friese
 
Callbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascriptCallbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascriptŁukasz Kużyński
 
HTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymoreHTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymoreRemy Sharp
 
Asynchronous programming done right - Node.js
Asynchronous programming done right - Node.jsAsynchronous programming done right - Node.js
Asynchronous programming done right - Node.jsPiotr Pelczar
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
HTML5 JavaScript APIs
HTML5 JavaScript APIsHTML5 JavaScript APIs
HTML5 JavaScript APIsRemy Sharp
 
JavaScript Promises
JavaScript PromisesJavaScript Promises
JavaScript PromisesTomasz Bak
 
The promise of asynchronous php
The promise of asynchronous phpThe promise of asynchronous php
The promise of asynchronous phpWim Godden
 
Javascript call ObjC
Javascript call ObjCJavascript call ObjC
Javascript call ObjCLin Luxiang
 
Working with AFNetworking
Working with AFNetworkingWorking with AFNetworking
Working with AFNetworkingwaynehartman
 
Understanding the Node.js Platform
Understanding the Node.js PlatformUnderstanding the Node.js Platform
Understanding the Node.js PlatformDomenic Denicola
 
Understanding Asynchronous JavaScript
Understanding Asynchronous JavaScriptUnderstanding Asynchronous JavaScript
Understanding Asynchronous JavaScriptjnewmanux
 
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2KZepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2KThomas Fuchs
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to TornadoGavin Roy
 

Mais procurados (20)

The Promised Land (in Angular)
The Promised Land (in Angular)The Promised Land (in Angular)
The Promised Land (in Angular)
 
Avoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promisesAvoiding callback hell in Node js using promises
Avoiding callback hell in Node js using promises
 
Promise pattern
Promise patternPromise pattern
Promise pattern
 
Node.js in action
Node.js in actionNode.js in action
Node.js in action
 
async/await in Swift
async/await in Swiftasync/await in Swift
async/await in Swift
 
Callbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascriptCallbacks, promises, generators - asynchronous javascript
Callbacks, promises, generators - asynchronous javascript
 
HTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymoreHTML5: where flash isn't needed anymore
HTML5: where flash isn't needed anymore
 
Promises, Promises
Promises, PromisesPromises, Promises
Promises, Promises
 
Asynchronous programming done right - Node.js
Asynchronous programming done right - Node.jsAsynchronous programming done right - Node.js
Asynchronous programming done right - Node.js
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
HTML5 JavaScript APIs
HTML5 JavaScript APIsHTML5 JavaScript APIs
HTML5 JavaScript APIs
 
JavaScript Promises
JavaScript PromisesJavaScript Promises
JavaScript Promises
 
The promise of asynchronous php
The promise of asynchronous phpThe promise of asynchronous php
The promise of asynchronous php
 
Javascript call ObjC
Javascript call ObjCJavascript call ObjC
Javascript call ObjC
 
Working with AFNetworking
Working with AFNetworkingWorking with AFNetworking
Working with AFNetworking
 
Understanding the Node.js Platform
Understanding the Node.js PlatformUnderstanding the Node.js Platform
Understanding the Node.js Platform
 
Node.js - A Quick Tour
Node.js - A Quick TourNode.js - A Quick Tour
Node.js - A Quick Tour
 
Understanding Asynchronous JavaScript
Understanding Asynchronous JavaScriptUnderstanding Asynchronous JavaScript
Understanding Asynchronous JavaScript
 
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2KZepto.js, a jQuery-compatible mobile JavaScript framework in 2K
Zepto.js, a jQuery-compatible mobile JavaScript framework in 2K
 
An Introduction to Tornado
An Introduction to TornadoAn Introduction to Tornado
An Introduction to Tornado
 

Semelhante a Web Crawling with NodeJS

soft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch
 
Nodejs and WebSockets
Nodejs and WebSocketsNodejs and WebSockets
Nodejs and WebSocketsGonzalo Ayuso
 
Java script at backend nodejs
Java script at backend   nodejsJava script at backend   nodejs
Java script at backend nodejsAmit Thakkar
 
Javascript Frameworks for Joomla
Javascript Frameworks for JoomlaJavascript Frameworks for Joomla
Javascript Frameworks for JoomlaLuke Summerfield
 
Week 4 - jQuery + Ajax
Week 4 - jQuery + AjaxWeek 4 - jQuery + Ajax
Week 4 - jQuery + Ajaxbaygross
 
Express Presentation
Express PresentationExpress Presentation
Express Presentationaaronheckmann
 
Pracitcal AJAX
Pracitcal AJAXPracitcal AJAX
Pracitcal AJAXjherr
 
Building Applications Using Ajax
Building Applications Using AjaxBuilding Applications Using Ajax
Building Applications Using Ajaxs_pradeep
 
Introduction to Vert.x
Introduction to Vert.xIntroduction to Vert.x
Introduction to Vert.xYiguang Hu
 
Build Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBuild Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBob Paulin
 
An opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathonAn opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathonLuciano Mammino
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patternsStoyan Stefanov
 

Semelhante a Web Crawling with NodeJS (20)

soft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.js
 
Tornadoweb
TornadowebTornadoweb
Tornadoweb
 
Nodejs and WebSockets
Nodejs and WebSocketsNodejs and WebSockets
Nodejs and WebSockets
 
Java script at backend nodejs
Java script at backend   nodejsJava script at backend   nodejs
Java script at backend nodejs
 
Javascript Frameworks for Joomla
Javascript Frameworks for JoomlaJavascript Frameworks for Joomla
Javascript Frameworks for Joomla
 
Week 4 - jQuery + Ajax
Week 4 - jQuery + AjaxWeek 4 - jQuery + Ajax
Week 4 - jQuery + Ajax
 
Express Presentation
Express PresentationExpress Presentation
Express Presentation
 
dojo.Patterns
dojo.Patternsdojo.Patterns
dojo.Patterns
 
5.node js
5.node js5.node js
5.node js
 
Node intro
Node introNode intro
Node intro
 
Sanjeev ghai 12
Sanjeev ghai 12Sanjeev ghai 12
Sanjeev ghai 12
 
Pracitcal AJAX
Pracitcal AJAXPracitcal AJAX
Pracitcal AJAX
 
Html5 For Jjugccc2009fall
Html5 For Jjugccc2009fallHtml5 For Jjugccc2009fall
Html5 For Jjugccc2009fall
 
Building Applications Using Ajax
Building Applications Using AjaxBuilding Applications Using Ajax
Building Applications Using Ajax
 
jQuery: Events, Animation, Ajax
jQuery: Events, Animation, AjaxjQuery: Events, Animation, Ajax
jQuery: Events, Animation, Ajax
 
Introduction to Vert.x
Introduction to Vert.xIntroduction to Vert.x
Introduction to Vert.x
 
Build Your Own CMS with Apache Sling
Build Your Own CMS with Apache SlingBuild Your Own CMS with Apache Sling
Build Your Own CMS with Apache Sling
 
NodeJS
NodeJSNodeJS
NodeJS
 
An opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathonAn opinionated intro to Node.js - devrupt hospitality hackathon
An opinionated intro to Node.js - devrupt hospitality hackathon
 
JavaScript performance patterns
JavaScript performance patternsJavaScript performance patterns
JavaScript performance patterns
 

Mais de Sylvain Zimmer

Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneDeveloper-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneSylvain Zimmer
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with SparkSylvain Zimmer
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...Sylvain Zimmer
 
PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?Sylvain Zimmer
 
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Sylvain Zimmer
 
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013Sylvain Zimmer
 
140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of JavascriptSylvain Zimmer
 
Joshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewJoshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewSylvain Zimmer
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSSylvain Zimmer
 
no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4Sylvain Zimmer
 
Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Sylvain Zimmer
 
Archicamp présentation
Archicamp présentationArchicamp présentation
Archicamp présentationSylvain Zimmer
 
Twisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesTwisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesSylvain Zimmer
 

Mais de Sylvain Zimmer (13)

Developer-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing oneDeveloper-friendly taskqueues: What you should ask yourself before choosing one
Developer-friendly taskqueues: What you should ask yourself before choosing one
 
Ranking the Web with Spark
Ranking the Web with SparkRanking the Web with Spark
Ranking the Web with Spark
 
The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...The original vision of Nutch, 14 years later: Building an open source search ...
The original vision of Nutch, 14 years later: Building an open source search ...
 
PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?PyCon FR 2016 - Et si on recodait Google en Python ?
PyCon FR 2016 - Et si on recodait Google en Python ?
 
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
Why and how Pricing Assistant migrated from Celery to RQ - Paris.py #2
 
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
[fr] Introduction et Live-code Backbone.js à DevoxxFR 2013
 
140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript140byt.es - The Dark Side of Javascript
140byt.es - The Dark Side of Javascript
 
Joshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical OverviewJoshfire Framework 0.9 Technical Overview
Joshfire Framework 0.9 Technical Overview
 
Javascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJSJavascript Views, Client-side or Server-side with NodeJS
Javascript Views, Client-side or Server-side with NodeJS
 
no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4no.de quick presentation at #ParisJS 4
no.de quick presentation at #ParisJS 4
 
Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011Kinect + Javascript tech talk at #ParisJS Jan 2011
Kinect + Javascript tech talk at #ParisJS Jan 2011
 
Archicamp présentation
Archicamp présentationArchicamp présentation
Archicamp présentation
 
Twisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecasesTwisted presentation & Jamendo usecases
Twisted presentation & Jamendo usecases
 

Último

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Último (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Web Crawling with NodeJS

  • 4. ?
  • 5. NodeJS Server-side Javascript Async / Event-driven / Reactor pattern Small stdlib, Exploding module ecosystem
  • 6. Why? Boldly going where no one has g... Threads vs. Async ZOMG Server-side CSS3 selectors!
  • 8. Problems w/ Apricot if (file.match(/^https?:///)) {     var urlInfo = url.parse(file, parseQueryString=false),     host = http.createClient(((urlInfo.protocol === 'http:') ? 80 : 443), urlInfo.hostname),     req_url = urlInfo.pathname;     if (urlInfo.search) {       req_url += urlInfo.search;     }     var request = host.request('GET', req_url, { host: urlInfo.hostname });     request.addListener('response', function (response) {       var data = '';       response.addListener('data', function (chunk) {         data += chunk;       });       response.addListener("end", function() {         fnLoaderHandle(null, data);       });     });     if (request.end) {       request.end();     } else {       request.close();     }          } else {     fs.readFile(file, encoding='utf8', fnLoaderHandle);   }
  • 9. Problems w/ Apricot No advanced HTTP client in Node’s lib npm install request https + redirects + buffering
  • 10. Problems w/ Apricot Apricot.parse("<p id='test'>An HTML Fragment</p>", function(doc) { doc.find("selector"); // Populates internal collection, See Sizzle selector syntax (rules) doc.each(callback); // Itterates over the collection, applying a callback to each match (element) doc.remove(); // Removes all elements in the internal collection (See XUI Syntax) doc.inner("fragment"); // See XUI Syntax doc.outer("fragment"); // See XUI Syntax doc.top("fragment"); // See XUI Syntax doc.bottom("fragment"); // See XUI Syntax doc.before("fragment"); // See XUI Syntax doc.after("fragment"); // See XUI Syntax doc.hasClass("class"); // See XUI Syntax doc.addClass("class"); // See XUI Syntax doc.removeClass("class"); // See XUI Syntax doc.toHTML; // Returns the HTML doc.innerHTML; // Returns the innerHTML of the body. doc.toDOM; // Returns the DOM representation // Most methods are chainable, so this works doc.find("selector").addClass('foo').after(", just because"); });
  • 11. Problems w/ Apricot XUI api?! jQuery please :) require("jsdom").jQueryify !! var jsdom = require("jsdom"), window = jsdom.jsdom().createWindow(); jsdom.jQueryify(window, 'http://code.jquery.com/jquery-1.4.2.min.js' , function() { window.$('body').append('<div class="testing">Hello World, It works</div>'); console.log(window.$('.testing').text()); });
  • 13. generic-pool // Create a MySQL connection pool with // a max of 10 connections and a 30 second max idle time var poolModule = require('generic-pool'); var pool = poolModule.Pool({ name : 'mysql', create : function(callback) { var Client = require('mysql').Client; var c = new Client(); c.user = 'scott'; c.password = 'tiger'; c.database = 'mydb'; c.connect(); callback(c); }, destroy : function(client) { client.end(); }, max : 10, idleTimeoutMillis : 30000, log : false }); // borrow connection - callback function is called // once a resource becomes available pool.borrow(function(client) { client.query("select * from foo", [], function() { // return object back to pool pool.returnToPool(client); }); });
  • 14. So what? Apricot - XUI + jQuery + request + generic-pool + qunit + ? = ??
  • 15. Simple API? var Crawler = require("node-crawler").Crawler; var c = new Crawler({ "maxConnections":10, "timeout":60, "defaultHandler":function(error,result,$) { $("#content a:link").each(function(a) { c.queue(a.href); }) } }); c.queue(["http://jamendo.com/","http://tedxparis.com", ...]); c.queue([{ "uri":"http://parisjs.org/register", "method":"POST" "handler":function(error,result,$) { $("div:contains(Thank you)").after(" very much"); } }]);
  • 16. Name contest! :) node-crawler ? Crawly ? ?????
  • 17. Thanks! First code on github tonight Help & Forks welcomed (We’re hiring HTML5/JS hackers ;-) Also, http://html5weekend.org/