Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Search engine and web crawler
1. Search Engine & Web Crawling
Presented By:-
Vinay Arora
Assistant Professor
CSED, Thapar University
Patiala (Punjab)
2. Contents
• What is search engine
• Example and need of a search engine
• How search engine works?
• Web crawler
• Web crawling
▫ Factor affecting web crawling
robots.txt
sitemap.xml
manual submission of websites into database of specific search engine
amendment in <a> tag with <href> option
• Areas related to web crawling
▫ Indexing
▫ Searching algorithms
▫ Data mining and analysis
• Web crawler as Add On
▫ Downloading whole website (offline dump)
Demo Tool – httrack
• Examples of Web crawler
▫ Open source
3. What is a search engine
• A search engine is a searchable database which collects
information on web pages from the Internet.
• Indexes the information and then stores the result in a huge
database where it can be quickly searched.
• The search engine provides an interface to search the
database.
• When you enter a keyword into the search engine, the search
engine will look through the billions of web pages to help you
find the ones that you are looking for.
5. Need of search engine
• Variety An Internet search can generate a variety of sources for
information. Results from online encyclopedias, news stories, university
studies, discussion boards, and even personal blogs can come up in a basic
Internet search. This variety allows anyone searching for information to
choose the types of sources they would like to use, or to use a variety of
sources to gain a greater understanding of a subject.
• Organization Internet search engines help to organize the Internet and
individual websites. Search engines aid in organizing the vast amount of
information that can sometimes be scattered in various places on the same
web page into an organized list that can be used more easily.
• Precision Search engines do have the ability to provide refined or more
precise results. Being able to search more precisely allows you to cut down
on the amount of information generated by your search.
7. How search engine works?
A Search engine has three parts.
• Spider: Deploys a robot program
called a spider or robot designed
to track down web pages. It
follows the links these pages
contain, and add information to
search engines’ database.
Example: Googlebot (Google’s
robot program)
• Index: Database containing a
copy of each Web page gathered
by the spider.
• Search engine software :
Technology that enables users to
query the index and that returns
results in a schematic order.
9. Web crawler
• A Web crawler is a computer program that browses the World
Wide Web in a methodical, automated manner.
• Other names
Crawler
Spider
Robot (or bot)
Web agent
Wanderer, worm
• Examples: googlebot, msnbot, etc.
10. Sequential crawler
• This is a sequential crawler
• Seeds can be any list of
starting URLs
• Order of page visits is
determined by frontier data
structure
• Stop criterion can be
anything
12. Architecture of a crawler (Conti…)
• URL Frontier: containing URLs yet to be fetches in the
current crawl. At first, a seed set is stored in URL Frontier,
and a crawler begins by taking a URL from the seed set.
• DNS: domain name service resolution. Look up IP address for
domain names.
• Fetch: generally use the http protocol to fetch the URL.
• Parse: the page is parsed. Texts (images, videos, and etc.)
and Links are extracted.
13. Architecture of a crawler (Conti…)
• Content Seen?: test whether a web page with the same
content has already been seen at another URL. Need to
develop a way to measure the fingerprint of a web page.
• URL Filter:
▫ Whether the extracted URL should be excluded from the
frontier (robots.txt).
▫ URL should be normalized.
• Duplicate URL Elimination: the URL is checked for
duplicate elimination.
14. Webcrawling & factors affecting it
• Crawling (spidering): finding and downloading web pages
automatically.
• Factors include the things that deviate or restrict the crawler
to perform the crawling.
▫ robots.txt
▫ sitemap.xml
▫ manual submission of websites into database of specific
search engine
▫ amendment in <a> tag with <href> option
15. robots.txt
• The robots exclusion standard, also known as the robots
exclusion protocol or robots.txt protocol, is a standard used
by websites to communicate with web crawlers and other web
robots.
• The standard specifies the instruction format to be used to
inform the robot about which areas of the website should not
be processed or scanned.
• Robots are often used by search engines to categorize and
archive web sites, or by webmasters to proofread source code.
17. sitemap.xml
• The Sitemaps protocol allows a webmaster to inform search
engines about URLs on a website that are available for
crawling.
• A Sitemap is an XML file that lists the URLs for a site.
• It allows webmasters to include additional information about
each URL: when it was last updated, how often it changes, and
how important it is in relation to other URLs in the site.
• This allows search engines to crawl the site more intelligently.
Sitemaps are a URL inclusion protocol and
complement robots.txt, a URL exclusion protocol.
20. amendment in <a> tag with <href>
option
• The <a> tag defines a hyperlink, which is used to link from
one page to another.
• Visit W3Schools.com!
<a href="http://www.w3schools.com">Visit W3Schools.com!</a>
• <a rel="nofollow" href="http://www.w3schools.com">Visit
W3Schools.com!</a>
21. Areas related to web crawling -
Indexing
• Search engine indexing collects, parses, and stores data to
facilitate fast and accurate information retrieval.
• The purpose of storing an index is to optimize speed and
performance in finding relevant documents for a search
query.
• Without an index, the search engine would scan every
document in the corpus, which would require considerable
time and computing power.
22. Areas related to web crawling –
Indexing (Conti…)
• Search engine architectures vary in the way indexing is
performed and in methods of index storage to meet the
various design factors.
• Index data structures
▫ Suffix tree
▫ Inverted index
▫ Citation index
▫ Ngram index
▫ Document-term matrix
23. Areas related to web crawling -
Searching algorithms
• String Matching Algorithms
• Brute Force Algorithm
• Rabin Karp Algorithm
• Knuth-Morris-Pratt Algorithm
• Boyer Moore Algorithm
24. Areas related to web crawling - Data
mining and analysis
• Graph Mining
▫ Apriori-based Approach
▫ Pattern-Growth Approach
▫ Pattern growth-based frequent substructure mining
25. Web crawler as Add On
• Downloading whole website (offline dump) - httrack