SlideShare a Scribd company logo
1 of 17
“Web Crawler”

Ranjit R. Banshpal
1 1
Overview
 OBJECTIVE
 INTRODUCTION
PROBLEM STATEMENT
ARCHITECTURE OF WEB CRAWLER
APPROACHES FOR CRAWLING PROCESS
POLICIES USED
UTILITIES OF WEB CRAWLER
CONCLUSION
SCOPE FOR FUTURE
REFERENCES

2

2
Objective
 Internet users and accessible web pages.
Hypertext system .
Most crucial components in search engines and their
optimization would have a great effect on improving the
searching efficiency.

3

3
Introduction
Programs that exploit the graph structures of the web to
move from page to page.
Program that browses the World Wide Web in a
methodical, automated manner.
Search Engines:
Most crucial components
Improves the searching efficiency.

4
Literature survey
Literature survey paper 1
“Distributed Ontology-Driven Focused Crawling”
•Vertical search technologies.
•Focused crawling.
•Ontological structure.
Web Crawler architechture uses URL scoring functions,Scheduler
and DOM parser,Page ranker to download web pages.
57
• Literature survey paper 2
“Efficient Focused Crawling based on Best First Search”
•Seek out pages that are relevant to given keywords.
•A focused crawler analyze links that are likely to be most
relevant.
•“Best” first search strategy is identified as a “focused crawler”
Focused crawler has two main components:
(i)To find specific web page.
(ii)To proceed from seed pages.
8
6
Literature survey paper 3
“Design of an Ontology based Adaptive Crawler for
Hidden Web”.
•Deep web/ invisible web / hidden web.
•Accessing deep web using ontology.
•Download relevant hidden web pages.

79
• Literature survey paper 4
“URL Rule Based Focused Crawlers.”
• Use of URL regular expression .
• Retrieving Topic-specific Pages.

Search the topic-specific information, need to crawl a small
part of data use fewer server resources .

8 10
• Literature survey paper 5
“A Topic-Specific Web Crawler with Web Page
Hierarchy Based on HTML Dom-Tree.”
•Representation of data in hierarchical Dom-Tree.
•Dom-Tree is structural representation of HTML pages.
•Use the concept of Ontology.

9
Problem statement
Most prominent challenge with current web crawlers
Selection of important pages for downloading.
Cannot download all pages from the web.
It is important for the crawler
“To select the pages and to visit “important” pages first by
prioritizing the URLs in the queue properly.”
It minimizing the load on the websites crawled with
parallelization of the crawling process.

12
Functional diagram of web crawler

11
Approaches for Crawling process
Basically if we consider there are 2 different types of crawler
Priory
Defined path
A priory
Do not follow a specific path.

12 14
Policies Used
 A selection policy that states which pages to download.
 A politeness policy that states how to avoid overloading
web sites.
 A parallelization policy that states how to coordinate
distributed web crawl.

13
Utilities of Web Crawler
 Gather pages from the Web.
 Support a search engine.
 Perform data mining
 Improving the sites (web site analysis)

1416
Conclusion
The number of extracted documents was reduced. Link
analyzed, and deleted a great deal of irrelevant web page.
Crawling time is reduced. After a great deal of irrelevant
web page is deleted, crawling load is reduced.

15
References
Rodrigo Campos, Oscar Rojas, Mauricio Mar´ın, Marcelo Mendoza “Distributed
Ontology-Driven Focused Crawling” 2013 21st Euromicro International
Conference on Parallel, Distributed, and Network-Based Processing. 10666192/12 © 2012 IEEE DOI 10.1109/PDP.2013.23

Sunita Rawat, D. R. Patil “Efficient Focused Crawling based on Best First
Search” 978-1-4673-4529-3/12/c2012 IEEE.

Manvi, Ashutosh Dixit, Komal Kumar Bhatia “Design of an Ontology based
Adaptive Crawler for Hidden Web” 978-0-7695-4958-3/13© 2013 IEEE DOI
10.1109/CSNT.2013.140.

Xiaolin Zheng, Tao Zhou, Zukun Yu, Deren Chen “URL Rule Based Focused
Crawlers” IEEE International Conference on e-Business Engineering. 978-07695-3395-7/08 © 2008 IEEE DOI 10.1109/ICEBE.2008.61.

Yuekui Yang, Yajun Du, Yufeng Hai, Zhaoqiong Gao “A Topic-Specific Web
Crawler with Web Page Hierarchy
Based on HTML Dom-Tree” 2009 Asia-Pacific Conference on Information
Processing.

16
17

More Related Content

What's hot

Web browser architecture.87 to 88
Web browser architecture.87 to 88Web browser architecture.87 to 88
Web browser architecture.87 to 88
myrajendra
 
Responsive Web Design with HTML5 and CSS3
Responsive Web Design with HTML5 and CSS3Responsive Web Design with HTML5 and CSS3
Responsive Web Design with HTML5 and CSS3
Kannika Kong
 
CMS (CONTENT MANAGEMENT SYSTEM)
CMS (CONTENT MANAGEMENT SYSTEM)CMS (CONTENT MANAGEMENT SYSTEM)
CMS (CONTENT MANAGEMENT SYSTEM)
Aaina Katyal
 

What's hot (20)

Crawling and Indexing
Crawling and IndexingCrawling and Indexing
Crawling and Indexing
 
Web Crawler
Web CrawlerWeb Crawler
Web Crawler
 
WordPress Webinar Training Presentation
WordPress Webinar Training PresentationWordPress Webinar Training Presentation
WordPress Webinar Training Presentation
 
Web Crawling & Crawler
Web Crawling & CrawlerWeb Crawling & Crawler
Web Crawling & Crawler
 
Introduction to Web Components
Introduction to Web ComponentsIntroduction to Web Components
Introduction to Web Components
 
Content Management System
Content Management SystemContent Management System
Content Management System
 
WordPress Complete Tutorial
WordPress Complete TutorialWordPress Complete Tutorial
WordPress Complete Tutorial
 
Web browser architecture.87 to 88
Web browser architecture.87 to 88Web browser architecture.87 to 88
Web browser architecture.87 to 88
 
Comparing Search Engines
Comparing Search EnginesComparing Search Engines
Comparing Search Engines
 
Search Engine
Search EngineSearch Engine
Search Engine
 
Web data mining
Web data miningWeb data mining
Web data mining
 
Top 10 Google Analytics Reports
Top 10 Google Analytics ReportsTop 10 Google Analytics Reports
Top 10 Google Analytics Reports
 
Web Development and Web Development technologies - Temitayo Fadojutimi
Web Development and Web Development technologies - Temitayo FadojutimiWeb Development and Web Development technologies - Temitayo Fadojutimi
Web Development and Web Development technologies - Temitayo Fadojutimi
 
Introduction to WordPress
Introduction to WordPressIntroduction to WordPress
Introduction to WordPress
 
Responsive Web Design with HTML5 and CSS3
Responsive Web Design with HTML5 and CSS3Responsive Web Design with HTML5 and CSS3
Responsive Web Design with HTML5 and CSS3
 
Web Standards
Web StandardsWeb Standards
Web Standards
 
Web Performance Optimization
Web Performance OptimizationWeb Performance Optimization
Web Performance Optimization
 
Web Design (Tools)
Web Design (Tools)Web Design (Tools)
Web Design (Tools)
 
CMS (CONTENT MANAGEMENT SYSTEM)
CMS (CONTENT MANAGEMENT SYSTEM)CMS (CONTENT MANAGEMENT SYSTEM)
CMS (CONTENT MANAGEMENT SYSTEM)
 
Basic Wordpress PPT
Basic Wordpress PPT Basic Wordpress PPT
Basic Wordpress PPT
 

Viewers also liked

Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
JSCHO9
 

Viewers also liked (9)

Working of a Web Crawler
Working of a Web CrawlerWorking of a Web Crawler
Working of a Web Crawler
 
Search engine and web crawler
Search engine and web crawlerSearch engine and web crawler
Search engine and web crawler
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Google ppt by amit
Google ppt by amitGoogle ppt by amit
Google ppt by amit
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Web browser architecture
Web browser architectureWeb browser architecture
Web browser architecture
 
Architecture of the Web browser
Architecture of the Web browserArchitecture of the Web browser
Architecture of the Web browser
 
SOA Unit I
SOA Unit ISOA Unit I
SOA Unit I
 
Search Engines Presentation
Search Engines PresentationSearch Engines Presentation
Search Engines Presentation
 

Similar to “Web crawler”

AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
ijwscjournal
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
ijwscjournal
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
BookStoreLib
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_ppt
Manant Sweet
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
Ijcem Journal
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
ijceronline
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Denis Shestakov
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
butest
 

Similar to “Web crawler” (20)

Seminar on crawler
Seminar on crawlerSeminar on crawler
Seminar on crawler
 
[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web[LvDuit//Lab] Crawling the web
[LvDuit//Lab] Crawling the web
 
The Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web CrawlerThe Research on Related Technologies of Web Crawler
The Research on Related Technologies of Web Crawler
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
 
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
AN EXTENDED MODEL FOR EFFECTIVE MIGRATING PARALLEL WEB CRAWLING WITH DOMAIN S...
 
E3602042044
E3602042044E3602042044
E3602042044
 
Data mining in web search engine optimization
Data mining in web search engine optimizationData mining in web search engine optimization
Data mining in web search engine optimization
 
Sekhon final 1_ppt
Sekhon final 1_pptSekhon final 1_ppt
Sekhon final 1_ppt
 
Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1Mining web-logs-to-improve-website-organization1
Mining web-logs-to-improve-website-organization1
 
Web Mining.pptx
Web Mining.pptxWeb Mining.pptx
Web Mining.pptx
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
Design and Implementation of Carpool Data Acquisition Program Based on Web Cr...
 
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
DESIGN AND IMPLEMENTATION OF CARPOOL DATA ACQUISITION PROGRAM BASED ON WEB CR...
 
Web crawling
Web crawlingWeb crawling
Web crawling
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
Pdd crawler a focused web
Pdd crawler  a focused webPdd crawler  a focused web
Pdd crawler a focused web
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
 
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
A Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET TechnologyA Novel Interface to a Web Crawler using VB.NET Technology
A Novel Interface to a Web Crawler using VB.NET Technology
 
A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...A machine learning approach to web page filtering using ...
A machine learning approach to web page filtering using ...
 

More from ranjit banshpal

using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
ranjit banshpal
 

More from ranjit banshpal (15)

Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
Designing Hybrid Cryptosystem for Secure Transmission of Image Data using Bio...
 
SECURE IMAGE RETRIEVAL BASED ON HYBRID FEATURES AND HASHES
SECURE IMAGE RETRIEVAL BASED ON HYBRID FEATURES AND HASHESSECURE IMAGE RETRIEVAL BASED ON HYBRID FEATURES AND HASHES
SECURE IMAGE RETRIEVAL BASED ON HYBRID FEATURES AND HASHES
 
Secure Image Retrieval based on Hybrid Features and Hashes
Secure Image Retrieval based on Hybrid Features and HashesSecure Image Retrieval based on Hybrid Features and Hashes
Secure Image Retrieval based on Hybrid Features and Hashes
 
LCT in day2 day life
LCT in day2 day lifeLCT in day2 day life
LCT in day2 day life
 
Fingerprint recognition
Fingerprint recognitionFingerprint recognition
Fingerprint recognition
 
Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...Data mining technique for classification and feature evaluation using stream ...
Data mining technique for classification and feature evaluation using stream ...
 
Parallelization using open mp
Parallelization using open mpParallelization using open mp
Parallelization using open mp
 
Face recognition technology
Face recognition technologyFace recognition technology
Face recognition technology
 
using big-data methods analyse the Cross platform aviation
 using big-data methods analyse the Cross platform aviation using big-data methods analyse the Cross platform aviation
using big-data methods analyse the Cross platform aviation
 
E mail image spam filtering techniques
E mail image spam filtering techniquesE mail image spam filtering techniques
E mail image spam filtering techniques
 
Hybrid encryption
Hybrid encryption Hybrid encryption
Hybrid encryption
 
Autocorrelators1
Autocorrelators1Autocorrelators1
Autocorrelators1
 
Static Networks
Static NetworksStatic Networks
Static Networks
 
Ranjitbanshpal
RanjitbanshpalRanjitbanshpal
Ranjitbanshpal
 
Ranjitbanshpal1
Ranjitbanshpal1Ranjitbanshpal1
Ranjitbanshpal1
 

Recently uploaded

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Recently uploaded (20)

PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

“Web crawler”

  • 2. Overview  OBJECTIVE  INTRODUCTION PROBLEM STATEMENT ARCHITECTURE OF WEB CRAWLER APPROACHES FOR CRAWLING PROCESS POLICIES USED UTILITIES OF WEB CRAWLER CONCLUSION SCOPE FOR FUTURE REFERENCES 2 2
  • 3. Objective  Internet users and accessible web pages. Hypertext system . Most crucial components in search engines and their optimization would have a great effect on improving the searching efficiency. 3 3
  • 4. Introduction Programs that exploit the graph structures of the web to move from page to page. Program that browses the World Wide Web in a methodical, automated manner. Search Engines: Most crucial components Improves the searching efficiency. 4
  • 5. Literature survey Literature survey paper 1 “Distributed Ontology-Driven Focused Crawling” •Vertical search technologies. •Focused crawling. •Ontological structure. Web Crawler architechture uses URL scoring functions,Scheduler and DOM parser,Page ranker to download web pages. 57
  • 6. • Literature survey paper 2 “Efficient Focused Crawling based on Best First Search” •Seek out pages that are relevant to given keywords. •A focused crawler analyze links that are likely to be most relevant. •“Best” first search strategy is identified as a “focused crawler” Focused crawler has two main components: (i)To find specific web page. (ii)To proceed from seed pages. 8 6
  • 7. Literature survey paper 3 “Design of an Ontology based Adaptive Crawler for Hidden Web”. •Deep web/ invisible web / hidden web. •Accessing deep web using ontology. •Download relevant hidden web pages. 79
  • 8. • Literature survey paper 4 “URL Rule Based Focused Crawlers.” • Use of URL regular expression . • Retrieving Topic-specific Pages. Search the topic-specific information, need to crawl a small part of data use fewer server resources . 8 10
  • 9. • Literature survey paper 5 “A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree.” •Representation of data in hierarchical Dom-Tree. •Dom-Tree is structural representation of HTML pages. •Use the concept of Ontology. 9
  • 10. Problem statement Most prominent challenge with current web crawlers Selection of important pages for downloading. Cannot download all pages from the web. It is important for the crawler “To select the pages and to visit “important” pages first by prioritizing the URLs in the queue properly.” It minimizing the load on the websites crawled with parallelization of the crawling process. 12
  • 11. Functional diagram of web crawler 11
  • 12. Approaches for Crawling process Basically if we consider there are 2 different types of crawler Priory Defined path A priory Do not follow a specific path. 12 14
  • 13. Policies Used  A selection policy that states which pages to download.  A politeness policy that states how to avoid overloading web sites.  A parallelization policy that states how to coordinate distributed web crawl. 13
  • 14. Utilities of Web Crawler  Gather pages from the Web.  Support a search engine.  Perform data mining  Improving the sites (web site analysis) 1416
  • 15. Conclusion The number of extracted documents was reduced. Link analyzed, and deleted a great deal of irrelevant web page. Crawling time is reduced. After a great deal of irrelevant web page is deleted, crawling load is reduced. 15
  • 16. References Rodrigo Campos, Oscar Rojas, Mauricio Mar´ın, Marcelo Mendoza “Distributed Ontology-Driven Focused Crawling” 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. 10666192/12 © 2012 IEEE DOI 10.1109/PDP.2013.23 Sunita Rawat, D. R. Patil “Efficient Focused Crawling based on Best First Search” 978-1-4673-4529-3/12/c2012 IEEE. Manvi, Ashutosh Dixit, Komal Kumar Bhatia “Design of an Ontology based Adaptive Crawler for Hidden Web” 978-0-7695-4958-3/13© 2013 IEEE DOI 10.1109/CSNT.2013.140. Xiaolin Zheng, Tao Zhou, Zukun Yu, Deren Chen “URL Rule Based Focused Crawlers” IEEE International Conference on e-Business Engineering. 978-07695-3395-7/08 © 2008 IEEE DOI 10.1109/ICEBE.2008.61. Yuekui Yang, Yajun Du, Yufeng Hai, Zhaoqiong Gao “A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree” 2009 Asia-Pacific Conference on Information Processing. 16
  • 17. 17