SlideShare a Scribd company logo
1 of 25
Download to read offline
Supervised By
Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia
Dept. of Computer Science & Engineering
Faculty of Electronic Engineering,
Menoufiya University.
 The main purpose of this project is to build our own
search engine that should suffice for our needs as a
nation
 In this project has been tried to add customized
features to search engine such as building and
developing a time-based search engine that is meant
to deal with local and international news
 Question : What is a Search Engine?
 How web search engine work?
 Web crawler , Indexing , Ranking
 Lucene , Nutch , Solr
 Who uses solr?
 Setup Nutch for web crawling
 Setup Solr for search
 Running Nutch in Eclipse for developing
 Experiments
 Answer: A software that
 builds an index on text
 answers queries using that index
 A search engine offers
Scalability
Relevance Ranking
Integrates different data sources (email,
web pages, files, database,...)‫‏‬
 A search engine operates, in the following order
1. Web crawling
2. Indexing
3. Ranking
 a program or automated script which browses the
World Wide Web
 used to create a copy of all the visited pages for later
processing by a search engine
 it starts with a list of URLs to visit, called the seeds
 URLs recursively visited according to a set of policies
 A selection policy
 A re-visit policy
 A politeness policy
 A parallelization policy
 Indexing process entails how data is collected, parsed,
and stored to facilitate fast and accurate search query
evaluation.
 The process involves the following steps
 Data collection
 Data traversal
 Indexing
 Indexing process:
 Convert document
 Extract text and meta data
 Normalize text(stop word,stim)
 Write (inverted) index
 Example:
 Document 1: “Apache Lucene at Jazoon“
 Document 2: “Jazoon conference“
 Index:
 apache -> 1
 conference -> 2
 Jazoon -> 1, 2
 lucene -> 1
 The web search engine responds to a query that a user
enters into a web search engine to satisfy his or her
information needs
 a high-performance, scalable information retrieval
(IR) library
 lets you add searching capabilities to your
applications.
 free, open source project implemented in Java
 With Lucene, you can index and search email
messages, mailing-list archives, instant messenger
chats, your wiki pages…the list goes on.
 Web Search Engine Software
 Open source web crawler
 Coded entirely in the Java programming language
 Advantages
 Scalability
 Crawler Politeness
 Crawler Management
 Quality
 Open source enterprise search platform based on
Apache Lucene project.
 Powerful full-text search, hit highlighting, faceted
search
 Database integration, and rich document (e.g.,
Word, PDF) handling
 Download a binary package (apache-nutch-bin.zip)
 cd apache-nutch-1.X/
 bin/nutch crawl urls -dir crawl -depth 3 -topN 5
 Now you should be able to see the following directories
created:
 crawl/crawldb
 crawl/linkdb
 crawl/segments
 If you have a Solr core already set up and wish to index
to it we should use
bin/nutch crawl urls -solr http://localhost:8983/solr/ -
depth 3 -topN 5
Now skip to here for how to set up your Solr instance
and index your crawl data.
 Download binary file (apache-Solr-bin.zip)
 cd ${APACHE_SOLR_HOME}/example
 java -jar start.jar
 After you started Solr admin console, you should be
able to access the following link:
http://localhost:8983/solr/admin/
 Integrate Solr with Nutch
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml
${APACHE_SOLR_HOME}/example/solr/conf/
 restart Solr with the command “java -jar start.jar”
under ${APACHE_SOLR_HOME}/example
 run the Solr Index command:
bin/nutch solrindex http://127.0.0.1:8983/solr/
crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
 Crawling the Egyptian Universities
 Crawling the Arabic news websites
 Crawling the Arabic news websites
Mustafa Mohammed Ahmed Elkhiat
Email:melkhiat@gmail.com
A customized web search engine [autosaved]

More Related Content

What's hot

Understanding & Using Search Engine Optimization
Understanding & Using Search Engine OptimizationUnderstanding & Using Search Engine Optimization
Understanding & Using Search Engine OptimizationifPeople
 
The SEO Guide for Beginners
The SEO Guide for BeginnersThe SEO Guide for Beginners
The SEO Guide for BeginnersHugo Clery
 
Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...joelmaster
 
Pm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyPm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyprashant mishra
 
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAMGOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAMieijjournal
 
Se omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoSe omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoalexanderandreya
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine OptimizationArun Kumar
 
Enabling news companies as content curators
Enabling news companies as content curatorsEnabling news companies as content curators
Enabling news companies as content curatorsPARC, a Xerox company
 
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...Additya N
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaigntouchdown777a
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignTrafficInjectors
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaignbelieve52
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaignbobtravpa
 
SEO 101 Workshop 10/2
SEO 101 Workshop 10/2SEO 101 Workshop 10/2
SEO 101 Workshop 10/2451 Marketing
 

What's hot (16)

Understanding & Using Search Engine Optimization
Understanding & Using Search Engine OptimizationUnderstanding & Using Search Engine Optimization
Understanding & Using Search Engine Optimization
 
The SEO Guide for Beginners
The SEO Guide for BeginnersThe SEO Guide for Beginners
The SEO Guide for Beginners
 
Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...Entireweb review over 150 million searches per month with website submission ...
Entireweb review over 150 million searches per month with website submission ...
 
Pm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodologyPm shandilya-s-wcodew-web-methodology
Pm shandilya-s-wcodew-web-methodology
 
SEO
SEOSEO
SEO
 
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAMGOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
GOOGLE SEARCH ALGORITHM UPDATES AGAINST WEB SPAM
 
Se omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seoSe omoz the-beginners-guide-to-seo
Se omoz the-beginners-guide-to-seo
 
Search Engine Optimization
Search Engine OptimizationSearch Engine Optimization
Search Engine Optimization
 
Enabling news companies as content curators
Enabling news companies as content curatorsEnabling news companies as content curators
Enabling news companies as content curators
 
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
Google Ranking Factor - Search Engine Optmization - Keyword Analysis - Digita...
 
Seo adwords
Seo adwordsSeo adwords
Seo adwords
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaign
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaign
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaign
 
Essentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation CampaignEssentials of Search Engine Optimisation Campaign
Essentials of Search Engine Optimisation Campaign
 
SEO 101 Workshop 10/2
SEO 101 Workshop 10/2SEO 101 Workshop 10/2
SEO 101 Workshop 10/2
 

Viewers also liked

Viewers also liked (6)

Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011Arabic Search Marketing MediaME Presentation 2011
Arabic Search Marketing MediaME Presentation 2011
 
Wahyu asih 9e (tipe atau model jaringan)
Wahyu asih 9e (tipe atau model jaringan)Wahyu asih 9e (tipe atau model jaringan)
Wahyu asih 9e (tipe atau model jaringan)
 
The passive voice
The passive voiceThe passive voice
The passive voice
 
Wahyu asih 9e power point(sejarah internet)
Wahyu asih 9e power point(sejarah internet)Wahyu asih 9e power point(sejarah internet)
Wahyu asih 9e power point(sejarah internet)
 
Radiasibendahitam -phpapp02
Radiasibendahitam -phpapp02Radiasibendahitam -phpapp02
Radiasibendahitam -phpapp02
 
อนุตตรีย์ วัชรภา
อนุตตรีย์  วัชรภาอนุตตรีย์  วัชรภา
อนุตตรีย์ วัชรภา
 

Similar to A customized web search engine [autosaved]

Open source search engine
Open source search engineOpen source search engine
Open source search enginePrimya Tamil
 
Working of web browser.pptx
Working of web browser.pptxWorking of web browser.pptx
Working of web browser.pptxssuseraf60311
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrievaliosrjce
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneSwapnil & Patil
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
WP SESSION 2 PPT.ppt
WP SESSION 2 PPT.pptWP SESSION 2 PPT.ppt
WP SESSION 2 PPT.pptGFGCKCSKOLAR
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET Journal
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...ijwscjournal
 
Unit 5 World_Wide_Web.pptx
Unit 5 World_Wide_Web.pptxUnit 5 World_Wide_Web.pptx
Unit 5 World_Wide_Web.pptxDhruvPatel189174
 
Website and it's importance
Website and it's importanceWebsite and it's importance
Website and it's importanceRobinSingh347
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningIJMTST Journal
 

Similar to A customized web search engine [autosaved] (20)

Open source search engine
Open source search engineOpen source search engine
Open source search engine
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Working of web browser.pptx
Working of web browser.pptxWorking of web browser.pptx
Working of web browser.pptx
 
G017254554
G017254554G017254554
G017254554
 
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document RetrievalAn Intelligent Meta Search Engine for Efficient Web Document Retrieval
An Intelligent Meta Search Engine for Efficient Web Document Retrieval
 
Intelligent crawling and indexing using lucene
Intelligent crawling and indexing using luceneIntelligent crawling and indexing using lucene
Intelligent crawling and indexing using lucene
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
WP SESSION 2 PPT.ppt
WP SESSION 2 PPT.pptWP SESSION 2 PPT.ppt
WP SESSION 2 PPT.ppt
 
Faster and resourceful multi core web crawling
Faster and resourceful multi core web crawlingFaster and resourceful multi core web crawling
Faster and resourceful multi core web crawling
 
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web CrawlerIRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
IRJET-Deep Web Crawling Efficiently using Dynamic Focused Web Crawler
 
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
SPEEDING UP THE WEB CRAWLING PROCESS ON A MULTI-CORE PROCESSOR USING VIRTUALI...
 
Unit 5 World_Wide_Web.pptx
Unit 5 World_Wide_Web.pptxUnit 5 World_Wide_Web.pptx
Unit 5 World_Wide_Web.pptx
 
Website and it's importance
Website and it's importanceWebsite and it's importance
Website and it's importance
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
How search engine work ppt
How search engine work pptHow search engine work ppt
How search engine work ppt
 
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive LearningA Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
A Two Stage Crawler on Web Search using Site Ranker for Adaptive Learning
 

Recently uploaded

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 

Recently uploaded (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 

A customized web search engine [autosaved]

  • 1. Supervised By Dr. Mohamed A. El-Rashidy Eng. Ahmed Ghozia Dept. of Computer Science & Engineering Faculty of Electronic Engineering, Menoufiya University.
  • 2.  The main purpose of this project is to build our own search engine that should suffice for our needs as a nation  In this project has been tried to add customized features to search engine such as building and developing a time-based search engine that is meant to deal with local and international news
  • 3.  Question : What is a Search Engine?  How web search engine work?  Web crawler , Indexing , Ranking  Lucene , Nutch , Solr  Who uses solr?  Setup Nutch for web crawling  Setup Solr for search  Running Nutch in Eclipse for developing  Experiments
  • 4.  Answer: A software that  builds an index on text  answers queries using that index  A search engine offers Scalability Relevance Ranking Integrates different data sources (email, web pages, files, database,...)‫‏‬
  • 5.  A search engine operates, in the following order 1. Web crawling 2. Indexing 3. Ranking
  • 6.  a program or automated script which browses the World Wide Web  used to create a copy of all the visited pages for later processing by a search engine  it starts with a list of URLs to visit, called the seeds  URLs recursively visited according to a set of policies  A selection policy  A re-visit policy  A politeness policy  A parallelization policy
  • 7.  Indexing process entails how data is collected, parsed, and stored to facilitate fast and accurate search query evaluation.  The process involves the following steps  Data collection  Data traversal  Indexing
  • 8.  Indexing process:  Convert document  Extract text and meta data  Normalize text(stop word,stim)  Write (inverted) index  Example:  Document 1: “Apache Lucene at Jazoon“  Document 2: “Jazoon conference“  Index:  apache -> 1  conference -> 2  Jazoon -> 1, 2  lucene -> 1
  • 9.  The web search engine responds to a query that a user enters into a web search engine to satisfy his or her information needs
  • 10.
  • 11.  a high-performance, scalable information retrieval (IR) library  lets you add searching capabilities to your applications.  free, open source project implemented in Java  With Lucene, you can index and search email messages, mailing-list archives, instant messenger chats, your wiki pages…the list goes on.
  • 12.  Web Search Engine Software  Open source web crawler  Coded entirely in the Java programming language  Advantages  Scalability  Crawler Politeness  Crawler Management  Quality
  • 13.  Open source enterprise search platform based on Apache Lucene project.  Powerful full-text search, hit highlighting, faceted search  Database integration, and rich document (e.g., Word, PDF) handling
  • 14.
  • 15.  Download a binary package (apache-nutch-bin.zip)  cd apache-nutch-1.X/  bin/nutch crawl urls -dir crawl -depth 3 -topN 5  Now you should be able to see the following directories created:  crawl/crawldb  crawl/linkdb  crawl/segments
  • 16.  If you have a Solr core already set up and wish to index to it we should use bin/nutch crawl urls -solr http://localhost:8983/solr/ - depth 3 -topN 5 Now skip to here for how to set up your Solr instance and index your crawl data.
  • 17.  Download binary file (apache-Solr-bin.zip)  cd ${APACHE_SOLR_HOME}/example  java -jar start.jar  After you started Solr admin console, you should be able to access the following link: http://localhost:8983/solr/admin/  Integrate Solr with Nutch cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/conf/
  • 18.  restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example  run the Solr Index command: bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb -linkdb crawl/linkdb crawl/segments/*
  • 19.
  • 20.  Crawling the Egyptian Universities
  • 21.  Crawling the Arabic news websites
  • 22.  Crawling the Arabic news websites
  • 23.
  • 24. Mustafa Mohammed Ahmed Elkhiat Email:melkhiat@gmail.com