SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
Multi-threaded web
crawler in Ruby
Hi,
I’m Kamil Durski, Senior Ruby Developer at Polcode
If improving Ruby skills is what you’re after, stick around. I’ll
show you how to use multiple threads to drastically increase
the efficiency of your application.
As I focus on threads, only the relevant code will be displayed in the slideshow.
Find the full source here.
The (much) underestimated threads
Ruby programmers have easy access to threads thanks to
build-in support.
Threads can be very useful, yet for some reason they don’t
receive much love.
Where can you use threads to see their prowess first-hand?
Crawling the web is a perfect example! Threads allow you to save
much time you’d spend waiting for data from the remote server.
I’m going to build a simple app so you can really understand
the power of threads. It will fetch info on some popular U.S.
TV shows (that one with dragons and an ex chemistry teacher
too!) from a bunch of websites.
But before we take a look at the code, let’s start with a few
slides of good old theory.
What’s the difference between
a thread and a process?
A multi-threaded app is capable of doing a lot of things at the
same time.
That’s because the app has the ability to switch between
threads, letting each of them use some of the process time.
But it’s still a single process
The same things goes for running many apps on a single-core
processor. It’s the operating system that does the switching.
Another big difference
Use threads within a single process and you can share memory
and variables between all of them, making development easier
Use multiple processes and processor cores and it’s no longer
the case – sharing data gets harder.
Check Wikipedia to find out more on threads.
Now we can go back to the TV shows. Aside of Ruby on Rails’
Active Record library for database access, all I’m going to use
are:
Three components from Ruby’s thread library:
1) Thread – the core class that runs multiple parts of code at
			 the same time,
2) Queue – this class will let me schedule jobs to be used by all
			 the threads,
3) Mutex – the role of the Mutex component is to synchronize
			 access to the resources. Thanks to that, the app
			 won’t switch to another thread too early.
The app itself is also divided into three major components:
1) Module
			 I’m going to supply the app with a list of modules to
			 run. The module creates multiple threads and tells	
			 the crawler what to do,
2) Crawler
			 I’m going to create crawler classes to fetch data
			from websites,
3) Model
			 Models will allow me to store and retrieve data
			from the database.
Crawler module
The Crawler module is responsible
for setting the environment and
connecting to the database.
The autoload calls refer to major
components inside the lib/
directory. The setup_env method
connects to the database and
adds app/ directories to the
$LOAD_PATH variable and includes
all of the files under app/ directory.
A new instance of the mutex
method is stored inside of the
@mutex variable. We can access it
by Crawler.mutex.
Crawler::Threads class
core feature
Now I’m going to create the core
feature of the app. I’m initializing a
few variables - @size, to know how
many threads to spawn, @threads
array to keep track of the threads,
and @queue to store the jobs to do.
I’m calling the #add method to add
each job to the queue. It accepts
optional arguments and a block.
Please, google block in Ruby if
you’re not familiar with the concept.
Next,the#start methodinitializes
threads and calls #join on each of
them.It’sessentialforthewholeapp
to work – otherwise once the main
thread is done with its job, it would
instantly kill spawned threads and
exit without finishing its job..
To complete the core functionality,
I’m calling the #pop method on a
block from the queue and then run
the block with the arguments from
the earlier #add method. The true
argument makes sure that it runs in
a non-blocking mode. Otherwise, I
would run into a deadlock with the
thread waiting for a new job to be
addedevenafterthequeueisalready
emptied (eventually throwing
anapplicationerror „Nolivethreads
left. Deadlock?”).
I can use the Crawler::Threads
class to crawl multiple pages at the
same time.
NowIcanrunsomecodetoseewhat
all of it amounts to:
10 second to visit 10 pages and fetch
somebasicinformation.Alright,now
I’m going to try 10 threads.
All it took to do the same task is 1.51 s!
The app no longer wastes time doing nothing while waiting for
the remote server to deliver data.
Additionally, what’s interesting, the input order is different –
for the single thread option it’s the same as the config file. For
the multi-threaded it’s random, as some threads do their job
faster.
Thread safety
The code I used outputs information
using puts. It’s not a thread-safe
way of doing this as it causes two
particular things:
	 - outputs a given string,
	 - then outputs the new line (NL)
	 character.
This may cause random instances of
NLcharactersappearingoutofplace
as the thread switches in the middle
andanother assumes controlbefore
the NL character is printed See the
example below:
I fixed this with mutex by creating a
custom #log method to output the
information to the console wrapped
in it:
Now the console output is always
in order as the thread waits for the
puts to finish.
And that’s it.
Nowyouknowmoreabouthowthreadswork.
I wrote this code as a side project the topic of web crawling
being an important part of what I do. The previous version
included more features such as the usage of proxies and TOR
networksupport.Thelatterimprovesanonymitybutalsoslows
down the code a lot.
Thanks for your time and, again, feel free to tackle the entire
code at:
https://github.com/kdurski/crawler

Mais conteúdo relacionado

Mais procurados

Node.js Express Tutorial | Node.js Tutorial For Beginners | Node.js + Expres...
Node.js Express Tutorial | Node.js Tutorial For Beginners | Node.js +  Expres...Node.js Express Tutorial | Node.js Tutorial For Beginners | Node.js +  Expres...
Node.js Express Tutorial | Node.js Tutorial For Beginners | Node.js + Expres...Edureka!
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Webinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform EngineeringWebinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform EngineeringOpenCredo
 
Spring boot - an introduction
Spring boot - an introductionSpring boot - an introduction
Spring boot - an introductionJonathan Holloway
 
How to implement a truly modular ecommerce platform on the example of Spryker...
How to implement a truly modular ecommerce platform on the example of Spryker...How to implement a truly modular ecommerce platform on the example of Spryker...
How to implement a truly modular ecommerce platform on the example of Spryker...Fabian Wesner
 
Angular directives and pipes
Angular directives and pipesAngular directives and pipes
Angular directives and pipesKnoldus Inc.
 
1. web technology basics
1. web technology basics1. web technology basics
1. web technology basicsJyoti Yadav
 
ASP.NET Core MVC + Web API with Overview
ASP.NET Core MVC + Web API with OverviewASP.NET Core MVC + Web API with Overview
ASP.NET Core MVC + Web API with OverviewShahed Chowdhuri
 
Database , 4 Data Integration
Database , 4 Data IntegrationDatabase , 4 Data Integration
Database , 4 Data IntegrationAli Usman
 
Angular Directives | Angular 2 Custom Directives | Angular Tutorial | Angular...
Angular Directives | Angular 2 Custom Directives | Angular Tutorial | Angular...Angular Directives | Angular 2 Custom Directives | Angular Tutorial | Angular...
Angular Directives | Angular 2 Custom Directives | Angular Tutorial | Angular...Edureka!
 

Mais procurados (20)

Node.js Express Tutorial | Node.js Tutorial For Beginners | Node.js + Expres...
Node.js Express Tutorial | Node.js Tutorial For Beginners | Node.js +  Expres...Node.js Express Tutorial | Node.js Tutorial For Beginners | Node.js +  Expres...
Node.js Express Tutorial | Node.js Tutorial For Beginners | Node.js + Expres...
 
Angular js
Angular jsAngular js
Angular js
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Webinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform EngineeringWebinar - Design Thinking for Platform Engineering
Webinar - Design Thinking for Platform Engineering
 
REST & RESTful Web Services
REST & RESTful Web ServicesREST & RESTful Web Services
REST & RESTful Web Services
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Angular Schematics
Angular SchematicsAngular Schematics
Angular Schematics
 
Spring boot - an introduction
Spring boot - an introductionSpring boot - an introduction
Spring boot - an introduction
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
How to implement a truly modular ecommerce platform on the example of Spryker...
How to implement a truly modular ecommerce platform on the example of Spryker...How to implement a truly modular ecommerce platform on the example of Spryker...
How to implement a truly modular ecommerce platform on the example of Spryker...
 
Angular directives and pipes
Angular directives and pipesAngular directives and pipes
Angular directives and pipes
 
Express js
Express jsExpress js
Express js
 
1. web technology basics
1. web technology basics1. web technology basics
1. web technology basics
 
BDD com Cucumber
BDD com CucumberBDD com Cucumber
BDD com Cucumber
 
ASP.NET Core MVC + Web API with Overview
ASP.NET Core MVC + Web API with OverviewASP.NET Core MVC + Web API with Overview
ASP.NET Core MVC + Web API with Overview
 
Laravel overview
Laravel overviewLaravel overview
Laravel overview
 
Php Presentation
Php PresentationPhp Presentation
Php Presentation
 
MongoDB
MongoDBMongoDB
MongoDB
 
Database , 4 Data Integration
Database , 4 Data IntegrationDatabase , 4 Data Integration
Database , 4 Data Integration
 
Angular Directives | Angular 2 Custom Directives | Angular Tutorial | Angular...
Angular Directives | Angular 2 Custom Directives | Angular Tutorial | Angular...Angular Directives | Angular 2 Custom Directives | Angular Tutorial | Angular...
Angular Directives | Angular 2 Custom Directives | Angular Tutorial | Angular...
 

Destaque

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...M. Atif Qureshi
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Sanchit Saini
 
Threading and Concurrency in Ruby
Threading and Concurrency in RubyThreading and Concurrency in Ruby
Threading and Concurrency in RubyTim Raymond
 
Ruby thread safety first
Ruby thread safety firstRuby thread safety first
Ruby thread safety firstEmily Stolfo
 
Threads in Ruby (Basics)
Threads in Ruby (Basics)Threads in Ruby (Basics)
Threads in Ruby (Basics)varunlalan
 
Ruby Concurrency and EventMachine
Ruby Concurrency and EventMachineRuby Concurrency and EventMachine
Ruby Concurrency and EventMachineChristopher Spring
 
Concurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple SpacesConcurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple Spacesluccastera
 
building blocks of a scalable webcrawler
building blocks of a scalable webcrawlerbuilding blocks of a scalable webcrawler
building blocks of a scalable webcrawlerMarc Seeger
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threadsmperham
 
鐵道女孩向前衝-RubyKaigi心得分享
鐵道女孩向前衝-RubyKaigi心得分享鐵道女孩向前衝-RubyKaigi心得分享
鐵道女孩向前衝-RubyKaigi心得分享Yu-Chen Chen
 
SXSW 2016: The Need To Knows
SXSW 2016: The Need To KnowsSXSW 2016: The Need To Knows
SXSW 2016: The Need To KnowsOgilvy Consulting
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017LinkedIn
 

Destaque (17)

Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
 
Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler Working with WebSPHINX Web Crawler
Working with WebSPHINX Web Crawler
 
Threading and Concurrency in Ruby
Threading and Concurrency in RubyThreading and Concurrency in Ruby
Threading and Concurrency in Ruby
 
Ruby thread safety first
Ruby thread safety firstRuby thread safety first
Ruby thread safety first
 
Threads in Ruby (Basics)
Threads in Ruby (Basics)Threads in Ruby (Basics)
Threads in Ruby (Basics)
 
Ruby Concurrency and EventMachine
Ruby Concurrency and EventMachineRuby Concurrency and EventMachine
Ruby Concurrency and EventMachine
 
Concurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple SpacesConcurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple Spaces
 
Webcrawler
Webcrawler Webcrawler
Webcrawler
 
building blocks of a scalable webcrawler
building blocks of a scalable webcrawlerbuilding blocks of a scalable webcrawler
building blocks of a scalable webcrawler
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Actors and Threads
Actors and ThreadsActors and Threads
Actors and Threads
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
鐵道女孩向前衝-RubyKaigi心得分享
鐵道女孩向前衝-RubyKaigi心得分享鐵道女孩向前衝-RubyKaigi心得分享
鐵道女孩向前衝-RubyKaigi心得分享
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
SXSW 2016: The Need To Knows
SXSW 2016: The Need To KnowsSXSW 2016: The Need To Knows
SXSW 2016: The Need To Knows
 
The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017The Top Skills That Can Get You Hired in 2017
The Top Skills That Can Get You Hired in 2017
 

Semelhante a Multi-threaded web crawler in Ruby

Concurrency in java
Concurrency in javaConcurrency in java
Concurrency in javaSaquib Sajid
 
Java Performance, Threading and Concurrent Data Structures
Java Performance, Threading and Concurrent Data StructuresJava Performance, Threading and Concurrent Data Structures
Java Performance, Threading and Concurrent Data StructuresHitendra Kumar
 
RubyMotion Inspect Conference - 2013. (With speaker notes.)
RubyMotion Inspect Conference - 2013. (With speaker notes.)RubyMotion Inspect Conference - 2013. (With speaker notes.)
RubyMotion Inspect Conference - 2013. (With speaker notes.)alloy020
 
Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009pauldix
 
Multithreading and concurrency.pptx
Multithreading and concurrency.pptxMultithreading and concurrency.pptx
Multithreading and concurrency.pptxShymmaaQadoom1
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails FinalRobert Postill
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNicole Gomez
 
Understanding the Single Thread Event Loop
Understanding the Single Thread Event LoopUnderstanding the Single Thread Event Loop
Understanding the Single Thread Event LoopTorontoNodeJS
 
Multithreading 101
Multithreading 101Multithreading 101
Multithreading 101Tim Penhey
 
Introductionto Xm Lmessaging
Introductionto Xm LmessagingIntroductionto Xm Lmessaging
Introductionto Xm LmessagingLiquidHub
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web ApplicationMichael Choi
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxPrudhvi668506
 

Semelhante a Multi-threaded web crawler in Ruby (20)

Concurrency and parallel in .net
Concurrency and parallel in .netConcurrency and parallel in .net
Concurrency and parallel in .net
 
Ruby openfest
Ruby openfestRuby openfest
Ruby openfest
 
Concurrency in java
Concurrency in javaConcurrency in java
Concurrency in java
 
Graphql
GraphqlGraphql
Graphql
 
Java Performance, Threading and Concurrent Data Structures
Java Performance, Threading and Concurrent Data StructuresJava Performance, Threading and Concurrent Data Structures
Java Performance, Threading and Concurrent Data Structures
 
RubyMotion Inspect Conference - 2013. (With speaker notes.)
RubyMotion Inspect Conference - 2013. (With speaker notes.)RubyMotion Inspect Conference - 2013. (With speaker notes.)
RubyMotion Inspect Conference - 2013. (With speaker notes.)
 
Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009Synchronous Reads Asynchronous Writes RubyConf 2009
Synchronous Reads Asynchronous Writes RubyConf 2009
 
J threads-pdf
J threads-pdfJ threads-pdf
J threads-pdf
 
Multithreading and concurrency.pptx
Multithreading and concurrency.pptxMultithreading and concurrency.pptx
Multithreading and concurrency.pptx
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
 
Nt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language AnalysisNt1310 Unit 3 Language Analysis
Nt1310 Unit 3 Language Analysis
 
Understanding the Single Thread Event Loop
Understanding the Single Thread Event LoopUnderstanding the Single Thread Event Loop
Understanding the Single Thread Event Loop
 
Multithreading 101
Multithreading 101Multithreading 101
Multithreading 101
 
Assignment 2
Assignment 2Assignment 2
Assignment 2
 
The mean stack
The mean stackThe mean stack
The mean stack
 
Introductionto Xm Lmessaging
Introductionto Xm LmessagingIntroductionto Xm Lmessaging
Introductionto Xm Lmessaging
 
Best node js course
Best node js courseBest node js course
Best node js course
 
System design for Web Application
System design for Web ApplicationSystem design for Web Application
System design for Web Application
 
MultiThreading in Python
MultiThreading in PythonMultiThreading in Python
MultiThreading in Python
 
Operating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptxOperating Systems R20 Unit 2.pptx
Operating Systems R20 Unit 2.pptx
 

Mais de Polcode

How to keep customers engaged to turn them into fans
How to keep customers engaged to turn them into fansHow to keep customers engaged to turn them into fans
How to keep customers engaged to turn them into fansPolcode
 
Expert Advice on ERP
Expert Advice on ERPExpert Advice on ERP
Expert Advice on ERPPolcode
 
User Experience (UX): Brand-Customer Interaction
User Experience (UX): Brand-Customer InteractionUser Experience (UX): Brand-Customer Interaction
User Experience (UX): Brand-Customer InteractionPolcode
 
The Difference Between UX and UI
The Difference Between UX and UIThe Difference Between UX and UI
The Difference Between UX and UIPolcode
 
5 Benefits of Utilizing Machine Learning in eLearning
5 Benefits of Utilizing Machine Learning in eLearning5 Benefits of Utilizing Machine Learning in eLearning
5 Benefits of Utilizing Machine Learning in eLearningPolcode
 
KrakowJS Conference Highlights
KrakowJS Conference HighlightsKrakowJS Conference Highlights
KrakowJS Conference HighlightsPolcode
 
Best Practices for Dropdowns
Best Practices for DropdownsBest Practices for Dropdowns
Best Practices for DropdownsPolcode
 
What’s Next for the Web?
What’s Next for the Web?What’s Next for the Web?
What’s Next for the Web?Polcode
 
Book Recommended By Our CTO
Book Recommended By Our CTOBook Recommended By Our CTO
Book Recommended By Our CTOPolcode
 
8 Biggest Web Design Trends For 2018 eCommerce
8 Biggest Web Design Trends For 2018 eCommerce8 Biggest Web Design Trends For 2018 eCommerce
8 Biggest Web Design Trends For 2018 eCommercePolcode
 
World Wide Web today
World Wide Web todayWorld Wide Web today
World Wide Web todayPolcode
 
Wordpress in numbers
Wordpress in numbersWordpress in numbers
Wordpress in numbersPolcode
 
Cryptocurrencies in e-commerce
Cryptocurrencies in e-commerceCryptocurrencies in e-commerce
Cryptocurrencies in e-commercePolcode
 
Why Choose WooCommerce?
Why Choose WooCommerce?Why Choose WooCommerce?
Why Choose WooCommerce?Polcode
 
A guide to vastly improving your eCommerce business by investing nothing more...
A guide to vastly improving your eCommerce business by investing nothing more...A guide to vastly improving your eCommerce business by investing nothing more...
A guide to vastly improving your eCommerce business by investing nothing more...Polcode
 
Boost your conversions by 40% and more with these 10 growth hacking tips!
Boost your conversions by 40% and more with these 10 growth hacking tips!Boost your conversions by 40% and more with these 10 growth hacking tips!
Boost your conversions by 40% and more with these 10 growth hacking tips!Polcode
 
Future web developer, you are going to be tremendously valuable
Future web developer, you are going to be tremendously valuableFuture web developer, you are going to be tremendously valuable
Future web developer, you are going to be tremendously valuablePolcode
 
10 reasons why Symfony is just the right fit for your project
10 reasons why Symfony is just the right fit for your project10 reasons why Symfony is just the right fit for your project
10 reasons why Symfony is just the right fit for your projectPolcode
 
Free, SaaS or Enterprise? You’re asking the wrong question!
Free, SaaS or Enterprise? You’re asking the wrong question!Free, SaaS or Enterprise? You’re asking the wrong question!
Free, SaaS or Enterprise? You’re asking the wrong question!Polcode
 
Improve your web and app development with the Symfony3 framework.
Improve your web and app development with the Symfony3 framework.Improve your web and app development with the Symfony3 framework.
Improve your web and app development with the Symfony3 framework.Polcode
 

Mais de Polcode (20)

How to keep customers engaged to turn them into fans
How to keep customers engaged to turn them into fansHow to keep customers engaged to turn them into fans
How to keep customers engaged to turn them into fans
 
Expert Advice on ERP
Expert Advice on ERPExpert Advice on ERP
Expert Advice on ERP
 
User Experience (UX): Brand-Customer Interaction
User Experience (UX): Brand-Customer InteractionUser Experience (UX): Brand-Customer Interaction
User Experience (UX): Brand-Customer Interaction
 
The Difference Between UX and UI
The Difference Between UX and UIThe Difference Between UX and UI
The Difference Between UX and UI
 
5 Benefits of Utilizing Machine Learning in eLearning
5 Benefits of Utilizing Machine Learning in eLearning5 Benefits of Utilizing Machine Learning in eLearning
5 Benefits of Utilizing Machine Learning in eLearning
 
KrakowJS Conference Highlights
KrakowJS Conference HighlightsKrakowJS Conference Highlights
KrakowJS Conference Highlights
 
Best Practices for Dropdowns
Best Practices for DropdownsBest Practices for Dropdowns
Best Practices for Dropdowns
 
What’s Next for the Web?
What’s Next for the Web?What’s Next for the Web?
What’s Next for the Web?
 
Book Recommended By Our CTO
Book Recommended By Our CTOBook Recommended By Our CTO
Book Recommended By Our CTO
 
8 Biggest Web Design Trends For 2018 eCommerce
8 Biggest Web Design Trends For 2018 eCommerce8 Biggest Web Design Trends For 2018 eCommerce
8 Biggest Web Design Trends For 2018 eCommerce
 
World Wide Web today
World Wide Web todayWorld Wide Web today
World Wide Web today
 
Wordpress in numbers
Wordpress in numbersWordpress in numbers
Wordpress in numbers
 
Cryptocurrencies in e-commerce
Cryptocurrencies in e-commerceCryptocurrencies in e-commerce
Cryptocurrencies in e-commerce
 
Why Choose WooCommerce?
Why Choose WooCommerce?Why Choose WooCommerce?
Why Choose WooCommerce?
 
A guide to vastly improving your eCommerce business by investing nothing more...
A guide to vastly improving your eCommerce business by investing nothing more...A guide to vastly improving your eCommerce business by investing nothing more...
A guide to vastly improving your eCommerce business by investing nothing more...
 
Boost your conversions by 40% and more with these 10 growth hacking tips!
Boost your conversions by 40% and more with these 10 growth hacking tips!Boost your conversions by 40% and more with these 10 growth hacking tips!
Boost your conversions by 40% and more with these 10 growth hacking tips!
 
Future web developer, you are going to be tremendously valuable
Future web developer, you are going to be tremendously valuableFuture web developer, you are going to be tremendously valuable
Future web developer, you are going to be tremendously valuable
 
10 reasons why Symfony is just the right fit for your project
10 reasons why Symfony is just the right fit for your project10 reasons why Symfony is just the right fit for your project
10 reasons why Symfony is just the right fit for your project
 
Free, SaaS or Enterprise? You’re asking the wrong question!
Free, SaaS or Enterprise? You’re asking the wrong question!Free, SaaS or Enterprise? You’re asking the wrong question!
Free, SaaS or Enterprise? You’re asking the wrong question!
 
Improve your web and app development with the Symfony3 framework.
Improve your web and app development with the Symfony3 framework.Improve your web and app development with the Symfony3 framework.
Improve your web and app development with the Symfony3 framework.
 

Último

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisamasabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...masabamasaba
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Bert Jan Schrijver
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 

Último (20)

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 

Multi-threaded web crawler in Ruby

  • 2. Hi, I’m Kamil Durski, Senior Ruby Developer at Polcode If improving Ruby skills is what you’re after, stick around. I’ll show you how to use multiple threads to drastically increase the efficiency of your application. As I focus on threads, only the relevant code will be displayed in the slideshow. Find the full source here.
  • 4. Ruby programmers have easy access to threads thanks to build-in support. Threads can be very useful, yet for some reason they don’t receive much love. Where can you use threads to see their prowess first-hand? Crawling the web is a perfect example! Threads allow you to save much time you’d spend waiting for data from the remote server.
  • 5. I’m going to build a simple app so you can really understand the power of threads. It will fetch info on some popular U.S. TV shows (that one with dragons and an ex chemistry teacher too!) from a bunch of websites. But before we take a look at the code, let’s start with a few slides of good old theory.
  • 6. What’s the difference between a thread and a process?
  • 7. A multi-threaded app is capable of doing a lot of things at the same time. That’s because the app has the ability to switch between threads, letting each of them use some of the process time. But it’s still a single process The same things goes for running many apps on a single-core processor. It’s the operating system that does the switching.
  • 8. Another big difference Use threads within a single process and you can share memory and variables between all of them, making development easier Use multiple processes and processor cores and it’s no longer the case – sharing data gets harder. Check Wikipedia to find out more on threads.
  • 9. Now we can go back to the TV shows. Aside of Ruby on Rails’ Active Record library for database access, all I’m going to use are: Three components from Ruby’s thread library: 1) Thread – the core class that runs multiple parts of code at the same time, 2) Queue – this class will let me schedule jobs to be used by all the threads, 3) Mutex – the role of the Mutex component is to synchronize access to the resources. Thanks to that, the app won’t switch to another thread too early.
  • 10. The app itself is also divided into three major components: 1) Module I’m going to supply the app with a list of modules to run. The module creates multiple threads and tells the crawler what to do, 2) Crawler I’m going to create crawler classes to fetch data from websites, 3) Model Models will allow me to store and retrieve data from the database.
  • 12. The Crawler module is responsible for setting the environment and connecting to the database.
  • 13. The autoload calls refer to major components inside the lib/ directory. The setup_env method connects to the database and adds app/ directories to the $LOAD_PATH variable and includes all of the files under app/ directory. A new instance of the mutex method is stored inside of the @mutex variable. We can access it by Crawler.mutex.
  • 15. Now I’m going to create the core feature of the app. I’m initializing a few variables - @size, to know how many threads to spawn, @threads array to keep track of the threads, and @queue to store the jobs to do.
  • 16. I’m calling the #add method to add each job to the queue. It accepts optional arguments and a block. Please, google block in Ruby if you’re not familiar with the concept.
  • 17. Next,the#start methodinitializes threads and calls #join on each of them.It’sessentialforthewholeapp to work – otherwise once the main thread is done with its job, it would instantly kill spawned threads and exit without finishing its job..
  • 18. To complete the core functionality, I’m calling the #pop method on a block from the queue and then run the block with the arguments from the earlier #add method. The true argument makes sure that it runs in a non-blocking mode. Otherwise, I would run into a deadlock with the thread waiting for a new job to be addedevenafterthequeueisalready emptied (eventually throwing anapplicationerror „Nolivethreads left. Deadlock?”).
  • 19. I can use the Crawler::Threads class to crawl multiple pages at the same time.
  • 21. 10 second to visit 10 pages and fetch somebasicinformation.Alright,now I’m going to try 10 threads.
  • 22. All it took to do the same task is 1.51 s! The app no longer wastes time doing nothing while waiting for the remote server to deliver data. Additionally, what’s interesting, the input order is different – for the single thread option it’s the same as the config file. For the multi-threaded it’s random, as some threads do their job faster.
  • 24. The code I used outputs information using puts. It’s not a thread-safe way of doing this as it causes two particular things: - outputs a given string, - then outputs the new line (NL) character. This may cause random instances of NLcharactersappearingoutofplace as the thread switches in the middle andanother assumes controlbefore the NL character is printed See the example below:
  • 25. I fixed this with mutex by creating a custom #log method to output the information to the console wrapped in it: Now the console output is always in order as the thread waits for the puts to finish.
  • 26. And that’s it. Nowyouknowmoreabouthowthreadswork. I wrote this code as a side project the topic of web crawling being an important part of what I do. The previous version included more features such as the usage of proxies and TOR networksupport.Thelatterimprovesanonymitybutalsoslows down the code a lot. Thanks for your time and, again, feel free to tackle the entire code at: https://github.com/kdurski/crawler