SlideShare a Scribd company logo
1 of 38
Crawling the Web
Web pages
  •Few thousand characters long
  •Served through the internet using the hypertext
  transport protocol (HTTP)
  •Viewed at client end using `browsers’
Crawler
  •To fetch the pages to the computer
  •At the computer
    Automatic   programs can analyze hypertext
    documents
HTML
 HyperText Markup Language
 Lets the author
      • specify layout and typeface
      • embed diagrams
      • create hyperlinks.
             expressedas an anchor tag with a HREF attribute
             HREF names another page using a Uniform
              Resource Locator (URL),
      • URL =
             protocol  field (“HTTP”) +
             a server hostname (“www.cse.iitb.ac.in”) +
             file path (/, the `root' of the published file system).
Mining the Web              Chakrabarti and Ramakrishnan                2
HTTP(hypertext transport
                   protocol)
 Built on top of the Transport Control Protocol
  (TCP)
 Steps(from client end)
   • resolve the server host name to an Internet address
         (IP)
             Use Domain Name Server (DNS)
             DNS is a distributed database of name-to-IP mappings
              maintained at a set of known servers
      • contact the server using TCP
             connect to default HTTP port (80) on the server.
             Enter the HTTP requests header (E.g.: GET)
             Fetch the response header
                  – MIME (Multipurpose Internet Mail Extensions)
                  – A meta-data standard for email and Web content transfer
Mining the Web                   Chakrabarti and Ramakrishnan                 3
                Fetch the HTML page
Crawl “all” Web pages?
 Problem: no catalog of all accessible URLs
  on the Web.
 Solution:
      • start from a given set of URLs
      • Progressively fetch and scan them for new
          outlinking URLs
      •   fetch these pages in turn…..
      •   Submit the text in page to a text indexing
          system
      •   and so on……….
Mining the Web          Chakrabarti and Ramakrishnan   4
Crawling procedure
 Simple
      • Great deal of engineering goes into industry-
          strength crawlers
      •   Industry crawlers crawl a substantial fraction
          of the Web
   •      E.g.: Alta Vista, Northern Lights, Inktomi
 No guarantee that all accessible Web
  pages will be located in this fashion
 Crawler may never halt …….
   • pages will be added continually even as it is
          running.
Mining the Web          Chakrabarti and Ramakrishnan       5
Crawling overheads
 Delays involved in
      • Resolving the host name in the URL to an IP
          address using DNS
      •   Connecting a socket to the server and sending
          the request
   •      Receiving the requested page in response
 Solution: Overlap the above delays by
   • fetching many pages at the same time


Mining the Web         Chakrabarti and Ramakrishnan       6
Anatomy of a crawler.
 Page fetching threads
      • Starts with DNS resolution
      • Finishes when the entire page has been
         fetched
 Each page
  • stored in compressed form to disk/tape
  • scanned for outlinks
 Work pool of outlinks
  • maintain network utilization without
         overloading it
             Dealt   with by load manager
 Continue till he crawler has collected a
Mining the Web              Chakrabarti and Ramakrishnan   7
Typical anatomy of a large-scale crawler.
Mining the Web               Chakrabarti and Ramakrishnan    8
Large-scale crawlers: performance

    and reliability considerations
  Need to fetch many pages at same time
  • utilize the network bandwidth
  • single page fetch may involve several seconds of
       network latency
 Highly concurrent and parallelized DNS lookups
 Use of asynchronous sockets
  • Explicit encoding of the state of a fetch context in a
       data structure
   •   Polling socket to check for completion of network
       transfers
       •
       Multi-processing or multi-threading: Impractical
 Care in URL extraction
       • Eliminating duplicates to reduce redundant fetches
Mining • Avoiding “spider Chakrabarti”and Ramakrishnan
       the Web             traps                              9
DNS caching, pre-fetching and
                resolution
       A customized DNS component with…..
      1. Custom client for address resolution
      2. Caching server
      3. Prefetching client




Mining the Web       Chakrabarti and Ramakrishnan   10
Custom client for address resolution
 Tailored for concurrent handling of
  multiple outstanding requests
 Allows issuing of many resolution requests
  together
      • polling at a later time for completion of
         individual requests
 Facilitates load distribution among many
  DNS servers.


Mining the Web         Chakrabarti and Ramakrishnan   11
Caching server
 With a large cache, persistent across DNS
  restarts
 Residing largely in memory if possible.




Mining the Web     Chakrabarti and Ramakrishnan   12
Prefetching client
•         Steps
      1. Parse a page that has just been fetched
      2. extract host names from HREF targets
      3. Make DNS resolution requests to the caching
             server
•         Usually implemented using UDP
      • User Datagram Protocol
      • connectionless, packet-based communication
             protocol
      •      does not guarantee packet delivery
•         Does not wait for resolution to be
          completed.
Mining the Web            Chakrabarti and Ramakrishnan   13
Multiple concurrent fetches
•       Managing multiple concurrent
        connections
      • A single download may take several seconds
      • Open many socket connections to different
             HTTP servers simultaneously
•       Multi-CPU machines not useful
      • crawling performance limited by network
             and disk
•       Two approaches
      1. using multi-threading
      2. using non-blocking sockets with event
Mining the Web          Chakrabarti and Ramakrishnan   14
Multi-threading
• logical threads
   • physical thread of control provided by the operating
         system (E.g.: pthreads) OR
   •     concurrent processes
• fixed number of threads allocated in advance
• programming paradigm
   • create a client socket
   • connect the socket to the HTTP service on a server
   • Send the HTTP request header
   • read the socket (recv) until
            •    no more characters are available
   • close the socket.
• use blocking system calls
Mining the Web                   Chakrabarti and Ramakrishnan   15
Multi-threading: Problems
• performance penalty
   • mutual exclusion
   • concurrent access to data structures
• slow disk seeks.
   • great deal of interleaved, random input-output
          on disk
      •   Due to concurrent modification of document
          repository by multiple threads



Mining the Web          Chakrabarti and Ramakrishnan   16
Non-blocking sockets and event
              handlers
• non-blocking sockets
   • connect, send or recv call returns immediately
     without waiting for the network operation to
     complete.
   • poll the status of the network operation separately
• “select” system call
   • lets application suspend until more data can be read
     from or written to the socket
  •  timing out after a pre-specified deadline
  •  Monitor polls several sockets at the same time
• More efficient memory management
  • code that completes processing not interrupted by
           other completions
      • No need for locks and semaphores on the pool
Mining the Web             Chakrabarti and Ramakrishnan     17
Link extraction and normalization
• Goal: Obtaining a canonical form of URL
• URL processing and filtering
      • Avoid multiple fetches of pages known by
          different URLs
      •   many IP addresses
            •    For load balancing on large sites
                  • Mirrored contents/contents on same file system
            •    “Proxy pass“
                  • Mapping of different host names to a single IP address
                  • need to publish many logical sites

      • Relative URLs
            •    need to be interpreted w.r.t to a base URL.

Mining the Web                  Chakrabarti and Ramakrishnan                 18
Canonical URL
                   Formed by
•   Using a standard string for the protocol
•   Canonicalizing the host name
•   Adding an explicit port number
•   Normalizing and cleaning up the path




Mining the Web     Chakrabarti and Ramakrishnan   19
Robot exclusion
• Check
      • whether the server prohibits crawling a
          normalized URL
      •   In robots.txt file in the HTTP root directory of
          the server
            •    species a list of path prefixes which crawlers should
                 not attempt to fetch.
• Meant for crawlers only



Mining the Web                 Chakrabarti and Ramakrishnan         20
Eliminating already-visited URLs
 Checking if a URL has already been fetched
  • Before adding a new URL to the work pool
  • Needs to be very quick.
  • Achieved by computing MD5 hash function on the
         URL
 Exploiting spatio-temporal locality of access
                Two-level hash function.
                   – most significant bits (say, 24) derived by hashing the host name
                     plus port
                   – lower order bits (say, 40) derived by hashing the path
                concatenated bits use d as a key in a B-tree
 qualifying URLs added to frontier of the crawl.
 hash values added to B-tree.
Mining the Web                    Chakrabarti and Ramakrishnan                     21
Spider traps
 Protecting from crashing on
      • Ill-formed HTML
             E.g.:   page with 68 kB of null characters
      • Misleading sites
             indefinite number of pages dynamically generated
              by CGI scripts
             paths of arbitrary depth created using soft
              directory links and path remapping features in
              HTTP server




Mining the Web                Chakrabarti and Ramakrishnan       22
Spider Traps: Solutions
 No automatic technique can be foolproof
 Check for URL length
 Guards
      • Preparing regular crawl statistics
      • Adding dominating sites to guard module
      • Disable crawling active content such as CGI
          form queries
      •   Eliminate URLs with non-textual data types



Mining the Web         Chakrabarti and Ramakrishnan    23
Avoiding repeated expansion of
         links on duplicate pages
 Reduce redundancy in crawls
 Duplicate detection
  • Mirrored Web pages and sites
 Detecting exact duplicates
  • Checking against MD5 digests of stored URLs
  • Representing a relative link v(relativetoaliasesu1and
         u2)as tuples (h(u1);v) and (h(u2);v)
 Detecting near-duplicates
  • Even a single altered character will completely change
         the digest !
                E.g.: date of update/ name and email of the site
                 administrator
      • Solution : Shingling and Ramakrishnan
Mining the Web           Chakrabarti                                24
Load monitor
         Keeps track of various system statistics
      • Recent performance of the wide area
             network (WAN) connection
                E.g.: latency and bandwidth estimates.
      • Operator-provided/estimated upper bound
             on open sockets for a crawler
      •      Current number of active sockets.




Mining the Web              Chakrabarti and Ramakrishnan   25
Thread manager
 Responsible for
       Choosing units of work from frontier
       Scheduling issue of network resources
       Distribution of these requests over multiple
         ISPs if appropriate.
 Uses statistics from load monitor




Mining the Web         Chakrabarti and Ramakrishnan    26
Per-server work queues
 Denial of service (DoS) attacks
       limit the speed or frequency of responses to
         any fixed client IP address
 Avoiding DOS
       limit the number of active requests to a given
        server IP address at any time
       maintain a queue of requests for each server
                Use the HTTP/1.1 persistent socket capability.
       Distribute attention relatively evenly between
         a large number of sites
 Access locality vs. politeness dilemma
Mining the Web                Chakrabarti and Ramakrishnan        27
Text repository
 Crawler’s last task
    Dumping fetched pages into a repository
 Decoupling crawler from other functions
  for efficiency and reliability preferred
 Page-related information stored in two
  parts
    meta-data
    page contents.


Mining the Web     Chakrabarti and Ramakrishnan   28
Storage of page-related information
 Meta-data
       relational in nature
                usually managed by custom software to avoid
                 relation database system overheads
                text index involves bulk updates
       includes fields like content-type, last-modified
         date, content-length, HTTP status code, etc.




Mining the Web               Chakrabarti and Ramakrishnan      29
Page contents storage
 Typical HTML Web page compresses to 2-
  4 kB (using zlib)
 File systems have a 4-8 kB file block size
   Too large !!
 Page storage managed by custom storage
  manager
   simple access methods for
                crawler to add pages
                Subsequent programs (Indexer etc) to retrieve
                 documents

Mining the Web                Chakrabarti and Ramakrishnan       30
Page Storage
 Small-scale systems
       Repository fitting within the disks of a single
        machine
       Use of storage manager (E.g.: Berkeley DB)
                Manage disk-based databases within a single file
                configuration as a hash-table/B-tree for URL
                 access key
                   To handle ordered access of pages
                configuration as a sequential log of page records.
                   Since Indexer can handle pages in any order



Mining the Web                 Chakrabarti and Ramakrishnan           31
Page Storage
 Large Scale systems
       Repository distributed over a number of
        storage servers
       Storage servers
                Connected to the crawler through a fast local
                 network (E.g.: Ethernet)
                Hashed by URLs
       `T3' grade leased lines.
                To handle 10 million pages (40 GB) per hour



Mining the Web                Chakrabarti and Ramakrishnan       32
Large-scale crawlers often use multiple ISPs and a bank of local storage
                   servers to store the pages crawled.



Mining the Web            Chakrabarti and Ramakrishnan                 33
Refreshing crawled pages
   Search engine's index should be fresh
   Web-scale crawler never `completes' its job
   High variance of rate of page changes
   “If-modified-since” request header with
    HTTP protocol
   Impractical for a crawler
 Solution
   At commencement of new crawling round
         estimate which pages have changed

Mining the Web          Chakrabarti and Ramakrishnan   34
Determining page changes
 “Expires” HTTP response header
    For page that come with an expiry date
 Otherwise need to guess if revisiting that
  page will yield a modified version.
    Score reflecting probability of page being
        modified
       Crawler fetches URLs in decreasing order of
        score.
       Assumption : recent past predicts the future

Mining the Web        Chakrabarti and Ramakrishnan     35
Estimating page change rates
 Brewington and Cybenko & Cho
       Algorithms for maintaining a crawl in which
         most pages are fresher than a specified epoch.
 Prerequisite
       average interval at which crawler checks for
         changes is smaller than the inter-modification
         times of a page
 Small scale intermediate crawler runs
       to monitor fast changing sites
                E.g.: current news, weather, etc.
       Patched intermediate indices into master
         index
Mining the Web                Chakrabarti and Ramakrishnan   36
Putting together a crawler
       Reference implementation of the HTTP client
         protocol
                World-wide Web Consortium (http://www.w3c.org/
                 )
                w3c-libwww package




Mining the Web               Chakrabarti and Ramakrishnan     37
Design of the core components:
             Crawler class.
 To copy bytes from network sockets to storage
  media
 Three methods to express Crawler's contract
  with user
   pushing a URL to be fetched to the Crawler
          (fetchPush)
         Termination callback handler (fetchDone) called with
          same URL
         Method (start) which starts Crawler's event loop.
 Implementation of Crawler class
    Need for two helper classes called DNS and Fetch


Mining the Web            Chakrabarti and Ramakrishnan       38

More Related Content

What's hot

General Method of HTTP Messages Authentication Based on Hash Functions in Web...
General Method of HTTP Messages Authentication Based on Hash Functions in Web...General Method of HTTP Messages Authentication Based on Hash Functions in Web...
General Method of HTTP Messages Authentication Based on Hash Functions in Web...Denis Kolegov
 
Globus for System Administrators (GlobusWorld Tour - UCSD)
Globus for System Administrators (GlobusWorld Tour - UCSD)Globus for System Administrators (GlobusWorld Tour - UCSD)
Globus for System Administrators (GlobusWorld Tour - UCSD)Globus
 
Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01jgregory1234
 
Covert Timing Channels based on HTTP Cache Headers (Special Edition for Top 1...
Covert Timing Channels based on HTTP Cache Headers (Special Edition for Top 1...Covert Timing Channels based on HTTP Cache Headers (Special Edition for Top 1...
Covert Timing Channels based on HTTP Cache Headers (Special Edition for Top 1...Denis Kolegov
 
Building your own CDN using Amazon EC2
Building your own CDN using Amazon EC2Building your own CDN using Amazon EC2
Building your own CDN using Amazon EC2SergeyChernyshev
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageDaniel Rohan
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSBowenDing4
 
The constrained application protocol (coap) part 2
The constrained application protocol (coap)  part 2The constrained application protocol (coap)  part 2
The constrained application protocol (coap) part 2Hamdamboy (함담보이)
 
Challenges behind the scenes of the large Swiss e-Commerce shop apfelkiste.ch...
Challenges behind the scenes of the large Swiss e-Commerce shop apfelkiste.ch...Challenges behind the scenes of the large Swiss e-Commerce shop apfelkiste.ch...
Challenges behind the scenes of the large Swiss e-Commerce shop apfelkiste.ch...nine
 
Pulsar Summit Asia - Structured Data Stream with Apache Pulsar
Pulsar Summit Asia - Structured Data Stream with Apache PulsarPulsar Summit Asia - Structured Data Stream with Apache Pulsar
Pulsar Summit Asia - Structured Data Stream with Apache PulsarShivji Kumar Jha
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLRené Cannaò
 
Globus Command Line Interface (APS Workshop)
Globus Command Line Interface (APS Workshop)Globus Command Line Interface (APS Workshop)
Globus Command Line Interface (APS Workshop)Globus
 

What's hot (20)

General Method of HTTP Messages Authentication Based on Hash Functions in Web...
General Method of HTTP Messages Authentication Based on Hash Functions in Web...General Method of HTTP Messages Authentication Based on Hash Functions in Web...
General Method of HTTP Messages Authentication Based on Hash Functions in Web...
 
On being RESTful
On being RESTfulOn being RESTful
On being RESTful
 
SPDY Talk
SPDY TalkSPDY Talk
SPDY Talk
 
Globus for System Administrators (GlobusWorld Tour - UCSD)
Globus for System Administrators (GlobusWorld Tour - UCSD)Globus for System Administrators (GlobusWorld Tour - UCSD)
Globus for System Administrators (GlobusWorld Tour - UCSD)
 
IPFS: The Permanent Web
IPFS: The Permanent WebIPFS: The Permanent Web
IPFS: The Permanent Web
 
Web-Socket
Web-SocketWeb-Socket
Web-Socket
 
Qcon
QconQcon
Qcon
 
Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01Qcon 090408233824-phpapp01
Qcon 090408233824-phpapp01
 
Covert Timing Channels based on HTTP Cache Headers (Special Edition for Top 1...
Covert Timing Channels based on HTTP Cache Headers (Special Edition for Top 1...Covert Timing Channels based on HTTP Cache Headers (Special Edition for Top 1...
Covert Timing Channels based on HTTP Cache Headers (Special Edition for Top 1...
 
Building your own CDN using Amazon EC2
Building your own CDN using Amazon EC2Building your own CDN using Amazon EC2
Building your own CDN using Amazon EC2
 
Data Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud StorageData Security Governanace and Consumer Cloud Storage
Data Security Governanace and Consumer Cloud Storage
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
 
About Http Connection
About Http ConnectionAbout Http Connection
About Http Connection
 
The constrained application protocol (coap) part 2
The constrained application protocol (coap)  part 2The constrained application protocol (coap)  part 2
The constrained application protocol (coap) part 2
 
Building Web Services
Building Web ServicesBuilding Web Services
Building Web Services
 
Challenges behind the scenes of the large Swiss e-Commerce shop apfelkiste.ch...
Challenges behind the scenes of the large Swiss e-Commerce shop apfelkiste.ch...Challenges behind the scenes of the large Swiss e-Commerce shop apfelkiste.ch...
Challenges behind the scenes of the large Swiss e-Commerce shop apfelkiste.ch...
 
Pulsar Summit Asia - Structured Data Stream with Apache Pulsar
Pulsar Summit Asia - Structured Data Stream with Apache PulsarPulsar Summit Asia - Structured Data Stream with Apache Pulsar
Pulsar Summit Asia - Structured Data Stream with Apache Pulsar
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
 
Mdb dn 2016_11_ops_mgr
Mdb dn 2016_11_ops_mgrMdb dn 2016_11_ops_mgr
Mdb dn 2016_11_ops_mgr
 
Globus Command Line Interface (APS Workshop)
Globus Command Line Interface (APS Workshop)Globus Command Line Interface (APS Workshop)
Globus Command Line Interface (APS Workshop)
 

Similar to Web Crawling Techniques for Search Engines

JavaScript Service Worker Design Patterns for Better User Experience
JavaScript Service Worker Design Patterns for Better User ExperienceJavaScript Service Worker Design Patterns for Better User Experience
JavaScript Service Worker Design Patterns for Better User Experiencereeder29
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applicationsevilmike
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkScrapinghub
 
Debugging applications with network security tools
Debugging applications with network security toolsDebugging applications with network security tools
Debugging applications with network security toolsConFoo
 
CNIT 129S: Ch 3: Web Application Technologies
CNIT 129S: Ch 3: Web Application TechnologiesCNIT 129S: Ch 3: Web Application Technologies
CNIT 129S: Ch 3: Web Application TechnologiesSam Bowne
 
A Tale of 2 Systems
A Tale of 2 SystemsA Tale of 2 Systems
A Tale of 2 SystemsDavid Newman
 
CNIT 129S - Ch 3: Web Application Technologies
CNIT 129S - Ch 3: Web Application TechnologiesCNIT 129S - Ch 3: Web Application Technologies
CNIT 129S - Ch 3: Web Application TechnologiesSam Bowne
 
www | HTTP | HTML - Tutorial
www | HTTP | HTML - Tutorialwww | HTTP | HTML - Tutorial
www | HTTP | HTML - TutorialMSA Technosoft
 
haproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdfhaproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdfPawanVerma628806
 
HyperText Transfer Protocol
HyperText Transfer ProtocolHyperText Transfer Protocol
HyperText Transfer Protocolponduse
 
Performance_Out.pptx
Performance_Out.pptxPerformance_Out.pptx
Performance_Out.pptxsanjanabal
 
Performance out
Performance outPerformance out
Performance outJack Huang
 
Performance out
Performance outPerformance out
Performance outJack Huang
 
Performance out
Performance outPerformance out
Performance outJack Huang
 

Similar to Web Crawling Techniques for Search Engines (20)

JavaScript Service Worker Design Patterns for Better User Experience
JavaScript Service Worker Design Patterns for Better User ExperienceJavaScript Service Worker Design Patterns for Better User Experience
JavaScript Service Worker Design Patterns for Better User Experience
 
Building & Testing Scalable Rails Applications
Building & Testing Scalable Rails ApplicationsBuilding & Testing Scalable Rails Applications
Building & Testing Scalable Rails Applications
 
Frontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling frameworkFrontera: open source, large scale web crawling framework
Frontera: open source, large scale web crawling framework
 
Debugging applications with network security tools
Debugging applications with network security toolsDebugging applications with network security tools
Debugging applications with network security tools
 
CNIT 129S: Ch 3: Web Application Technologies
CNIT 129S: Ch 3: Web Application TechnologiesCNIT 129S: Ch 3: Web Application Technologies
CNIT 129S: Ch 3: Web Application Technologies
 
A Tale of 2 Systems
A Tale of 2 SystemsA Tale of 2 Systems
A Tale of 2 Systems
 
Slides cao
Slides caoSlides cao
Slides cao
 
CNIT 129S - Ch 3: Web Application Technologies
CNIT 129S - Ch 3: Web Application TechnologiesCNIT 129S - Ch 3: Web Application Technologies
CNIT 129S - Ch 3: Web Application Technologies
 
www | HTTP | HTML - Tutorial
www | HTTP | HTML - Tutorialwww | HTTP | HTML - Tutorial
www | HTTP | HTML - Tutorial
 
haproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdfhaproxy-150423120602-conversion-gate01.pdf
haproxy-150423120602-conversion-gate01.pdf
 
HAProxy
HAProxy HAProxy
HAProxy
 
HyperText Transfer Protocol
HyperText Transfer ProtocolHyperText Transfer Protocol
HyperText Transfer Protocol
 
Performance_Out.pptx
Performance_Out.pptxPerformance_Out.pptx
Performance_Out.pptx
 
2 7
2 72 7
2 7
 
Performance out
Performance outPerformance out
Performance out
 
Performance out
Performance outPerformance out
Performance out
 
Performance out
Performance outPerformance out
Performance out
 
Performance out
Performance outPerformance out
Performance out
 
Performance out
Performance outPerformance out
Performance out
 
Performance out
Performance outPerformance out
Performance out
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 

Web Crawling Techniques for Search Engines

  • 1. Crawling the Web Web pages •Few thousand characters long •Served through the internet using the hypertext transport protocol (HTTP) •Viewed at client end using `browsers’ Crawler •To fetch the pages to the computer •At the computer Automatic programs can analyze hypertext documents
  • 2. HTML  HyperText Markup Language  Lets the author • specify layout and typeface • embed diagrams • create hyperlinks.  expressedas an anchor tag with a HREF attribute  HREF names another page using a Uniform Resource Locator (URL), • URL =  protocol field (“HTTP”) +  a server hostname (“www.cse.iitb.ac.in”) +  file path (/, the `root' of the published file system). Mining the Web Chakrabarti and Ramakrishnan 2
  • 3. HTTP(hypertext transport protocol)  Built on top of the Transport Control Protocol (TCP)  Steps(from client end) • resolve the server host name to an Internet address (IP)  Use Domain Name Server (DNS)  DNS is a distributed database of name-to-IP mappings maintained at a set of known servers • contact the server using TCP  connect to default HTTP port (80) on the server.  Enter the HTTP requests header (E.g.: GET)  Fetch the response header – MIME (Multipurpose Internet Mail Extensions) – A meta-data standard for email and Web content transfer Mining the Web Chakrabarti and Ramakrishnan 3  Fetch the HTML page
  • 4. Crawl “all” Web pages?  Problem: no catalog of all accessible URLs on the Web.  Solution: • start from a given set of URLs • Progressively fetch and scan them for new outlinking URLs • fetch these pages in turn….. • Submit the text in page to a text indexing system • and so on………. Mining the Web Chakrabarti and Ramakrishnan 4
  • 5. Crawling procedure  Simple • Great deal of engineering goes into industry- strength crawlers • Industry crawlers crawl a substantial fraction of the Web • E.g.: Alta Vista, Northern Lights, Inktomi  No guarantee that all accessible Web pages will be located in this fashion  Crawler may never halt ……. • pages will be added continually even as it is running. Mining the Web Chakrabarti and Ramakrishnan 5
  • 6. Crawling overheads  Delays involved in • Resolving the host name in the URL to an IP address using DNS • Connecting a socket to the server and sending the request • Receiving the requested page in response  Solution: Overlap the above delays by • fetching many pages at the same time Mining the Web Chakrabarti and Ramakrishnan 6
  • 7. Anatomy of a crawler.  Page fetching threads • Starts with DNS resolution • Finishes when the entire page has been fetched  Each page • stored in compressed form to disk/tape • scanned for outlinks  Work pool of outlinks • maintain network utilization without overloading it  Dealt with by load manager  Continue till he crawler has collected a Mining the Web Chakrabarti and Ramakrishnan 7
  • 8. Typical anatomy of a large-scale crawler. Mining the Web Chakrabarti and Ramakrishnan 8
  • 9. Large-scale crawlers: performance  and reliability considerations Need to fetch many pages at same time • utilize the network bandwidth • single page fetch may involve several seconds of network latency  Highly concurrent and parallelized DNS lookups  Use of asynchronous sockets • Explicit encoding of the state of a fetch context in a data structure • Polling socket to check for completion of network transfers • Multi-processing or multi-threading: Impractical  Care in URL extraction • Eliminating duplicates to reduce redundant fetches Mining • Avoiding “spider Chakrabarti”and Ramakrishnan the Web traps 9
  • 10. DNS caching, pre-fetching and resolution  A customized DNS component with….. 1. Custom client for address resolution 2. Caching server 3. Prefetching client Mining the Web Chakrabarti and Ramakrishnan 10
  • 11. Custom client for address resolution  Tailored for concurrent handling of multiple outstanding requests  Allows issuing of many resolution requests together • polling at a later time for completion of individual requests  Facilitates load distribution among many DNS servers. Mining the Web Chakrabarti and Ramakrishnan 11
  • 12. Caching server  With a large cache, persistent across DNS restarts  Residing largely in memory if possible. Mining the Web Chakrabarti and Ramakrishnan 12
  • 13. Prefetching client • Steps 1. Parse a page that has just been fetched 2. extract host names from HREF targets 3. Make DNS resolution requests to the caching server • Usually implemented using UDP • User Datagram Protocol • connectionless, packet-based communication protocol • does not guarantee packet delivery • Does not wait for resolution to be completed. Mining the Web Chakrabarti and Ramakrishnan 13
  • 14. Multiple concurrent fetches • Managing multiple concurrent connections • A single download may take several seconds • Open many socket connections to different HTTP servers simultaneously • Multi-CPU machines not useful • crawling performance limited by network and disk • Two approaches 1. using multi-threading 2. using non-blocking sockets with event Mining the Web Chakrabarti and Ramakrishnan 14
  • 15. Multi-threading • logical threads • physical thread of control provided by the operating system (E.g.: pthreads) OR • concurrent processes • fixed number of threads allocated in advance • programming paradigm • create a client socket • connect the socket to the HTTP service on a server • Send the HTTP request header • read the socket (recv) until • no more characters are available • close the socket. • use blocking system calls Mining the Web Chakrabarti and Ramakrishnan 15
  • 16. Multi-threading: Problems • performance penalty • mutual exclusion • concurrent access to data structures • slow disk seeks. • great deal of interleaved, random input-output on disk • Due to concurrent modification of document repository by multiple threads Mining the Web Chakrabarti and Ramakrishnan 16
  • 17. Non-blocking sockets and event handlers • non-blocking sockets • connect, send or recv call returns immediately without waiting for the network operation to complete. • poll the status of the network operation separately • “select” system call • lets application suspend until more data can be read from or written to the socket • timing out after a pre-specified deadline • Monitor polls several sockets at the same time • More efficient memory management • code that completes processing not interrupted by other completions • No need for locks and semaphores on the pool Mining the Web Chakrabarti and Ramakrishnan 17
  • 18. Link extraction and normalization • Goal: Obtaining a canonical form of URL • URL processing and filtering • Avoid multiple fetches of pages known by different URLs • many IP addresses • For load balancing on large sites • Mirrored contents/contents on same file system • “Proxy pass“ • Mapping of different host names to a single IP address • need to publish many logical sites • Relative URLs • need to be interpreted w.r.t to a base URL. Mining the Web Chakrabarti and Ramakrishnan 18
  • 19. Canonical URL Formed by • Using a standard string for the protocol • Canonicalizing the host name • Adding an explicit port number • Normalizing and cleaning up the path Mining the Web Chakrabarti and Ramakrishnan 19
  • 20. Robot exclusion • Check • whether the server prohibits crawling a normalized URL • In robots.txt file in the HTTP root directory of the server • species a list of path prefixes which crawlers should not attempt to fetch. • Meant for crawlers only Mining the Web Chakrabarti and Ramakrishnan 20
  • 21. Eliminating already-visited URLs  Checking if a URL has already been fetched • Before adding a new URL to the work pool • Needs to be very quick. • Achieved by computing MD5 hash function on the URL  Exploiting spatio-temporal locality of access  Two-level hash function. – most significant bits (say, 24) derived by hashing the host name plus port – lower order bits (say, 40) derived by hashing the path  concatenated bits use d as a key in a B-tree  qualifying URLs added to frontier of the crawl.  hash values added to B-tree. Mining the Web Chakrabarti and Ramakrishnan 21
  • 22. Spider traps  Protecting from crashing on • Ill-formed HTML  E.g.: page with 68 kB of null characters • Misleading sites  indefinite number of pages dynamically generated by CGI scripts  paths of arbitrary depth created using soft directory links and path remapping features in HTTP server Mining the Web Chakrabarti and Ramakrishnan 22
  • 23. Spider Traps: Solutions  No automatic technique can be foolproof  Check for URL length  Guards • Preparing regular crawl statistics • Adding dominating sites to guard module • Disable crawling active content such as CGI form queries • Eliminate URLs with non-textual data types Mining the Web Chakrabarti and Ramakrishnan 23
  • 24. Avoiding repeated expansion of links on duplicate pages  Reduce redundancy in crawls  Duplicate detection • Mirrored Web pages and sites  Detecting exact duplicates • Checking against MD5 digests of stored URLs • Representing a relative link v(relativetoaliasesu1and u2)as tuples (h(u1);v) and (h(u2);v)  Detecting near-duplicates • Even a single altered character will completely change the digest !  E.g.: date of update/ name and email of the site administrator • Solution : Shingling and Ramakrishnan Mining the Web Chakrabarti 24
  • 25. Load monitor  Keeps track of various system statistics • Recent performance of the wide area network (WAN) connection  E.g.: latency and bandwidth estimates. • Operator-provided/estimated upper bound on open sockets for a crawler • Current number of active sockets. Mining the Web Chakrabarti and Ramakrishnan 25
  • 26. Thread manager  Responsible for  Choosing units of work from frontier  Scheduling issue of network resources  Distribution of these requests over multiple ISPs if appropriate.  Uses statistics from load monitor Mining the Web Chakrabarti and Ramakrishnan 26
  • 27. Per-server work queues  Denial of service (DoS) attacks  limit the speed or frequency of responses to any fixed client IP address  Avoiding DOS  limit the number of active requests to a given server IP address at any time  maintain a queue of requests for each server  Use the HTTP/1.1 persistent socket capability.  Distribute attention relatively evenly between a large number of sites  Access locality vs. politeness dilemma Mining the Web Chakrabarti and Ramakrishnan 27
  • 28. Text repository  Crawler’s last task  Dumping fetched pages into a repository  Decoupling crawler from other functions for efficiency and reliability preferred  Page-related information stored in two parts  meta-data  page contents. Mining the Web Chakrabarti and Ramakrishnan 28
  • 29. Storage of page-related information  Meta-data  relational in nature  usually managed by custom software to avoid relation database system overheads  text index involves bulk updates  includes fields like content-type, last-modified date, content-length, HTTP status code, etc. Mining the Web Chakrabarti and Ramakrishnan 29
  • 30. Page contents storage  Typical HTML Web page compresses to 2- 4 kB (using zlib)  File systems have a 4-8 kB file block size  Too large !!  Page storage managed by custom storage manager  simple access methods for  crawler to add pages  Subsequent programs (Indexer etc) to retrieve documents Mining the Web Chakrabarti and Ramakrishnan 30
  • 31. Page Storage  Small-scale systems  Repository fitting within the disks of a single machine  Use of storage manager (E.g.: Berkeley DB)  Manage disk-based databases within a single file  configuration as a hash-table/B-tree for URL access key  To handle ordered access of pages  configuration as a sequential log of page records.  Since Indexer can handle pages in any order Mining the Web Chakrabarti and Ramakrishnan 31
  • 32. Page Storage  Large Scale systems  Repository distributed over a number of storage servers  Storage servers  Connected to the crawler through a fast local network (E.g.: Ethernet)  Hashed by URLs  `T3' grade leased lines.  To handle 10 million pages (40 GB) per hour Mining the Web Chakrabarti and Ramakrishnan 32
  • 33. Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled. Mining the Web Chakrabarti and Ramakrishnan 33
  • 34. Refreshing crawled pages  Search engine's index should be fresh  Web-scale crawler never `completes' its job  High variance of rate of page changes  “If-modified-since” request header with HTTP protocol  Impractical for a crawler  Solution  At commencement of new crawling round estimate which pages have changed Mining the Web Chakrabarti and Ramakrishnan 34
  • 35. Determining page changes  “Expires” HTTP response header  For page that come with an expiry date  Otherwise need to guess if revisiting that page will yield a modified version.  Score reflecting probability of page being modified  Crawler fetches URLs in decreasing order of score.  Assumption : recent past predicts the future Mining the Web Chakrabarti and Ramakrishnan 35
  • 36. Estimating page change rates  Brewington and Cybenko & Cho  Algorithms for maintaining a crawl in which most pages are fresher than a specified epoch.  Prerequisite  average interval at which crawler checks for changes is smaller than the inter-modification times of a page  Small scale intermediate crawler runs  to monitor fast changing sites  E.g.: current news, weather, etc.  Patched intermediate indices into master index Mining the Web Chakrabarti and Ramakrishnan 36
  • 37. Putting together a crawler  Reference implementation of the HTTP client protocol  World-wide Web Consortium (http://www.w3c.org/ )  w3c-libwww package Mining the Web Chakrabarti and Ramakrishnan 37
  • 38. Design of the core components: Crawler class.  To copy bytes from network sockets to storage media  Three methods to express Crawler's contract with user  pushing a URL to be fetched to the Crawler (fetchPush)  Termination callback handler (fetchDone) called with same URL  Method (start) which starts Crawler's event loop.  Implementation of Crawler class  Need for two helper classes called DNS and Fetch Mining the Web Chakrabarti and Ramakrishnan 38