Web pages are served through HTTP and viewed in browsers. Crawlers fetch pages to build indexes. They start from seed URLs and recursively fetch new URLs found on pages. Crawlers face challenges like overlapping delays, avoiding duplicates, handling redirects, and preventing crawler traps. Large-scale crawlers employ techniques like multi-threading, non-blocking sockets, distributed storage, and estimating change rates to efficiently crawl billions of pages across the web.
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Web Crawling Techniques for Search Engines
1. Crawling the Web
Web pages
•Few thousand characters long
•Served through the internet using the hypertext
transport protocol (HTTP)
•Viewed at client end using `browsers’
Crawler
•To fetch the pages to the computer
•At the computer
Automatic programs can analyze hypertext
documents
2. HTML
HyperText Markup Language
Lets the author
• specify layout and typeface
• embed diagrams
• create hyperlinks.
expressedas an anchor tag with a HREF attribute
HREF names another page using a Uniform
Resource Locator (URL),
• URL =
protocol field (“HTTP”) +
a server hostname (“www.cse.iitb.ac.in”) +
file path (/, the `root' of the published file system).
Mining the Web Chakrabarti and Ramakrishnan 2
3. HTTP(hypertext transport
protocol)
Built on top of the Transport Control Protocol
(TCP)
Steps(from client end)
• resolve the server host name to an Internet address
(IP)
Use Domain Name Server (DNS)
DNS is a distributed database of name-to-IP mappings
maintained at a set of known servers
• contact the server using TCP
connect to default HTTP port (80) on the server.
Enter the HTTP requests header (E.g.: GET)
Fetch the response header
– MIME (Multipurpose Internet Mail Extensions)
– A meta-data standard for email and Web content transfer
Mining the Web Chakrabarti and Ramakrishnan 3
Fetch the HTML page
4. Crawl “all” Web pages?
Problem: no catalog of all accessible URLs
on the Web.
Solution:
• start from a given set of URLs
• Progressively fetch and scan them for new
outlinking URLs
• fetch these pages in turn…..
• Submit the text in page to a text indexing
system
• and so on……….
Mining the Web Chakrabarti and Ramakrishnan 4
5. Crawling procedure
Simple
• Great deal of engineering goes into industry-
strength crawlers
• Industry crawlers crawl a substantial fraction
of the Web
• E.g.: Alta Vista, Northern Lights, Inktomi
No guarantee that all accessible Web
pages will be located in this fashion
Crawler may never halt …….
• pages will be added continually even as it is
running.
Mining the Web Chakrabarti and Ramakrishnan 5
6. Crawling overheads
Delays involved in
• Resolving the host name in the URL to an IP
address using DNS
• Connecting a socket to the server and sending
the request
• Receiving the requested page in response
Solution: Overlap the above delays by
• fetching many pages at the same time
Mining the Web Chakrabarti and Ramakrishnan 6
7. Anatomy of a crawler.
Page fetching threads
• Starts with DNS resolution
• Finishes when the entire page has been
fetched
Each page
• stored in compressed form to disk/tape
• scanned for outlinks
Work pool of outlinks
• maintain network utilization without
overloading it
Dealt with by load manager
Continue till he crawler has collected a
Mining the Web Chakrabarti and Ramakrishnan 7
8. Typical anatomy of a large-scale crawler.
Mining the Web Chakrabarti and Ramakrishnan 8
9. Large-scale crawlers: performance
and reliability considerations
Need to fetch many pages at same time
• utilize the network bandwidth
• single page fetch may involve several seconds of
network latency
Highly concurrent and parallelized DNS lookups
Use of asynchronous sockets
• Explicit encoding of the state of a fetch context in a
data structure
• Polling socket to check for completion of network
transfers
•
Multi-processing or multi-threading: Impractical
Care in URL extraction
• Eliminating duplicates to reduce redundant fetches
Mining • Avoiding “spider Chakrabarti”and Ramakrishnan
the Web traps 9
10. DNS caching, pre-fetching and
resolution
A customized DNS component with…..
1. Custom client for address resolution
2. Caching server
3. Prefetching client
Mining the Web Chakrabarti and Ramakrishnan 10
11. Custom client for address resolution
Tailored for concurrent handling of
multiple outstanding requests
Allows issuing of many resolution requests
together
• polling at a later time for completion of
individual requests
Facilitates load distribution among many
DNS servers.
Mining the Web Chakrabarti and Ramakrishnan 11
12. Caching server
With a large cache, persistent across DNS
restarts
Residing largely in memory if possible.
Mining the Web Chakrabarti and Ramakrishnan 12
13. Prefetching client
• Steps
1. Parse a page that has just been fetched
2. extract host names from HREF targets
3. Make DNS resolution requests to the caching
server
• Usually implemented using UDP
• User Datagram Protocol
• connectionless, packet-based communication
protocol
• does not guarantee packet delivery
• Does not wait for resolution to be
completed.
Mining the Web Chakrabarti and Ramakrishnan 13
14. Multiple concurrent fetches
• Managing multiple concurrent
connections
• A single download may take several seconds
• Open many socket connections to different
HTTP servers simultaneously
• Multi-CPU machines not useful
• crawling performance limited by network
and disk
• Two approaches
1. using multi-threading
2. using non-blocking sockets with event
Mining the Web Chakrabarti and Ramakrishnan 14
15. Multi-threading
• logical threads
• physical thread of control provided by the operating
system (E.g.: pthreads) OR
• concurrent processes
• fixed number of threads allocated in advance
• programming paradigm
• create a client socket
• connect the socket to the HTTP service on a server
• Send the HTTP request header
• read the socket (recv) until
• no more characters are available
• close the socket.
• use blocking system calls
Mining the Web Chakrabarti and Ramakrishnan 15
16. Multi-threading: Problems
• performance penalty
• mutual exclusion
• concurrent access to data structures
• slow disk seeks.
• great deal of interleaved, random input-output
on disk
• Due to concurrent modification of document
repository by multiple threads
Mining the Web Chakrabarti and Ramakrishnan 16
17. Non-blocking sockets and event
handlers
• non-blocking sockets
• connect, send or recv call returns immediately
without waiting for the network operation to
complete.
• poll the status of the network operation separately
• “select” system call
• lets application suspend until more data can be read
from or written to the socket
• timing out after a pre-specified deadline
• Monitor polls several sockets at the same time
• More efficient memory management
• code that completes processing not interrupted by
other completions
• No need for locks and semaphores on the pool
Mining the Web Chakrabarti and Ramakrishnan 17
18. Link extraction and normalization
• Goal: Obtaining a canonical form of URL
• URL processing and filtering
• Avoid multiple fetches of pages known by
different URLs
• many IP addresses
• For load balancing on large sites
• Mirrored contents/contents on same file system
• “Proxy pass“
• Mapping of different host names to a single IP address
• need to publish many logical sites
• Relative URLs
• need to be interpreted w.r.t to a base URL.
Mining the Web Chakrabarti and Ramakrishnan 18
19. Canonical URL
Formed by
• Using a standard string for the protocol
• Canonicalizing the host name
• Adding an explicit port number
• Normalizing and cleaning up the path
Mining the Web Chakrabarti and Ramakrishnan 19
20. Robot exclusion
• Check
• whether the server prohibits crawling a
normalized URL
• In robots.txt file in the HTTP root directory of
the server
• species a list of path prefixes which crawlers should
not attempt to fetch.
• Meant for crawlers only
Mining the Web Chakrabarti and Ramakrishnan 20
21. Eliminating already-visited URLs
Checking if a URL has already been fetched
• Before adding a new URL to the work pool
• Needs to be very quick.
• Achieved by computing MD5 hash function on the
URL
Exploiting spatio-temporal locality of access
Two-level hash function.
– most significant bits (say, 24) derived by hashing the host name
plus port
– lower order bits (say, 40) derived by hashing the path
concatenated bits use d as a key in a B-tree
qualifying URLs added to frontier of the crawl.
hash values added to B-tree.
Mining the Web Chakrabarti and Ramakrishnan 21
22. Spider traps
Protecting from crashing on
• Ill-formed HTML
E.g.: page with 68 kB of null characters
• Misleading sites
indefinite number of pages dynamically generated
by CGI scripts
paths of arbitrary depth created using soft
directory links and path remapping features in
HTTP server
Mining the Web Chakrabarti and Ramakrishnan 22
23. Spider Traps: Solutions
No automatic technique can be foolproof
Check for URL length
Guards
• Preparing regular crawl statistics
• Adding dominating sites to guard module
• Disable crawling active content such as CGI
form queries
• Eliminate URLs with non-textual data types
Mining the Web Chakrabarti and Ramakrishnan 23
24. Avoiding repeated expansion of
links on duplicate pages
Reduce redundancy in crawls
Duplicate detection
• Mirrored Web pages and sites
Detecting exact duplicates
• Checking against MD5 digests of stored URLs
• Representing a relative link v(relativetoaliasesu1and
u2)as tuples (h(u1);v) and (h(u2);v)
Detecting near-duplicates
• Even a single altered character will completely change
the digest !
E.g.: date of update/ name and email of the site
administrator
• Solution : Shingling and Ramakrishnan
Mining the Web Chakrabarti 24
25. Load monitor
Keeps track of various system statistics
• Recent performance of the wide area
network (WAN) connection
E.g.: latency and bandwidth estimates.
• Operator-provided/estimated upper bound
on open sockets for a crawler
• Current number of active sockets.
Mining the Web Chakrabarti and Ramakrishnan 25
26. Thread manager
Responsible for
Choosing units of work from frontier
Scheduling issue of network resources
Distribution of these requests over multiple
ISPs if appropriate.
Uses statistics from load monitor
Mining the Web Chakrabarti and Ramakrishnan 26
27. Per-server work queues
Denial of service (DoS) attacks
limit the speed or frequency of responses to
any fixed client IP address
Avoiding DOS
limit the number of active requests to a given
server IP address at any time
maintain a queue of requests for each server
Use the HTTP/1.1 persistent socket capability.
Distribute attention relatively evenly between
a large number of sites
Access locality vs. politeness dilemma
Mining the Web Chakrabarti and Ramakrishnan 27
28. Text repository
Crawler’s last task
Dumping fetched pages into a repository
Decoupling crawler from other functions
for efficiency and reliability preferred
Page-related information stored in two
parts
meta-data
page contents.
Mining the Web Chakrabarti and Ramakrishnan 28
29. Storage of page-related information
Meta-data
relational in nature
usually managed by custom software to avoid
relation database system overheads
text index involves bulk updates
includes fields like content-type, last-modified
date, content-length, HTTP status code, etc.
Mining the Web Chakrabarti and Ramakrishnan 29
30. Page contents storage
Typical HTML Web page compresses to 2-
4 kB (using zlib)
File systems have a 4-8 kB file block size
Too large !!
Page storage managed by custom storage
manager
simple access methods for
crawler to add pages
Subsequent programs (Indexer etc) to retrieve
documents
Mining the Web Chakrabarti and Ramakrishnan 30
31. Page Storage
Small-scale systems
Repository fitting within the disks of a single
machine
Use of storage manager (E.g.: Berkeley DB)
Manage disk-based databases within a single file
configuration as a hash-table/B-tree for URL
access key
To handle ordered access of pages
configuration as a sequential log of page records.
Since Indexer can handle pages in any order
Mining the Web Chakrabarti and Ramakrishnan 31
32. Page Storage
Large Scale systems
Repository distributed over a number of
storage servers
Storage servers
Connected to the crawler through a fast local
network (E.g.: Ethernet)
Hashed by URLs
`T3' grade leased lines.
To handle 10 million pages (40 GB) per hour
Mining the Web Chakrabarti and Ramakrishnan 32
33. Large-scale crawlers often use multiple ISPs and a bank of local storage
servers to store the pages crawled.
Mining the Web Chakrabarti and Ramakrishnan 33
34. Refreshing crawled pages
Search engine's index should be fresh
Web-scale crawler never `completes' its job
High variance of rate of page changes
“If-modified-since” request header with
HTTP protocol
Impractical for a crawler
Solution
At commencement of new crawling round
estimate which pages have changed
Mining the Web Chakrabarti and Ramakrishnan 34
35. Determining page changes
“Expires” HTTP response header
For page that come with an expiry date
Otherwise need to guess if revisiting that
page will yield a modified version.
Score reflecting probability of page being
modified
Crawler fetches URLs in decreasing order of
score.
Assumption : recent past predicts the future
Mining the Web Chakrabarti and Ramakrishnan 35
36. Estimating page change rates
Brewington and Cybenko & Cho
Algorithms for maintaining a crawl in which
most pages are fresher than a specified epoch.
Prerequisite
average interval at which crawler checks for
changes is smaller than the inter-modification
times of a page
Small scale intermediate crawler runs
to monitor fast changing sites
E.g.: current news, weather, etc.
Patched intermediate indices into master
index
Mining the Web Chakrabarti and Ramakrishnan 36
37. Putting together a crawler
Reference implementation of the HTTP client
protocol
World-wide Web Consortium (http://www.w3c.org/
)
w3c-libwww package
Mining the Web Chakrabarti and Ramakrishnan 37
38. Design of the core components:
Crawler class.
To copy bytes from network sockets to storage
media
Three methods to express Crawler's contract
with user
pushing a URL to be fetched to the Crawler
(fetchPush)
Termination callback handler (fetchDone) called with
same URL
Method (start) which starts Crawler's event loop.
Implementation of Crawler class
Need for two helper classes called DNS and Fetch
Mining the Web Chakrabarti and Ramakrishnan 38