SlideShare uma empresa Scribd logo
1 de 33
Web Robots


 ISHAN MISHRA
www.IshanTech.org



                    1
Outline
   Robot applications
   How it works
   Cycle Avoidance




                         2
Applications
   Behavior of web robots
       Wander from web site to site (recursively),
       1. Fetching content,
       2. Following hyperlinks,
       3. Process the data they find.

   Colorful names
       Crawlers,
       Spiders,
       Worms,
       Bots


                                                      3
Where to Start: The “Root Set”


        A               G           L           S



B       C       D               M       N   T       U
                    H       I


                    J           O
    E       F


                    K       P   Q       R



                                                        4
Cycle Avoidance


      A        B         E                   B         E                   B       E


                                          AB

  A                C            A                C           A       ABC       C




           D                             D                            D


(a) Robot fetches page A,     (b) Robot follows link       (c) Robot follows link and
    follows link, fetches B       and fetches page C           is back to A
                                                                                    5
Loops
   Cycles are bad for crawlers for there
    reasons.
       Spending robot’s time and space
       Overwhelm the web site.
       Duplicate content.




                                            6
Data structure for robot
   Trees and hash table
   Lossy presence bit maps
   Checkpoints
       Save the list of visited URL to disk, in case the
        robot crashes
   Partitioning
       Robot farms


                                                            7
Canonicalizing URLs
       Most web robots try to eliminate the
        obvious aliases by “canonicalizing” URL
        into a standard form, by:
         adding “:80” to the hostname, if the port
          isn’t specified.
         Converting all %xx escaped characters into
          their character equivalents.
         Removing # tags


                                                       8
Symbolic link cycles

          /                              /




index.html    subdir           index.html     subdir




    index.html     logo.gif


(a) subdir is a directory     (b) subdir is an upward symbolic link


                                                                      9
Dynamic Virtual Web Spaces
   It can be possible to publish a URL that looks like a normal
    file but really is a gateway application.
   This application can generate HTML on the fly that
    contains links to imaginary URLs on the same server.
    When these imaginary URLs are requested, new imaginary
    URLs are generated.

   Such kind of malicious web server take the poor robot on
    an Alice-in-Wonderland journey through an infinite virtual
    space, even if the web server doesn’t really contain any
    files. Sometimes the robot is hard to detect this trap,
    because HTML and URLs may look very different all the
    time.

   For example, a CGI-based calendaring program
                                                                 10
Malicious dynamic web space
example




                              11
Techniques for avoiding loops
   Canonicalizing URLs
   Breath-first crawling
   Throttling
       Limit the number of pages the robot can fetch from a
        web site in a period of time.
   Limit URL size
       Avoid symbolic cycle problem.
       Problem: many sites use URLs to maintain user state.
   URL/site blacklist
       vs. “excluding Robot”

                                                               12
Techniques for avoiding loops
   Pattern detection
       e.g., “subdir/subdir/subdir…”
       e.g., “subdir/images/subdir/images/subdir/…”

   Content fingerprinting
       A checksum concept, while the odds of two different pages
        having the same check sum are small.
       Message digest functions such as MD5 are popular for this
        purpose.

   Human monitoring
       Should design your robot with diagnostics and logging, so
        human beings can easily monitor the robot’s process and be
        warned quickly if something unusual is happening.
                                                                     13
Robotic HTTP
   No different from any other HTTP client program.
   Many robots try to implement the minimum
    amount of HTTP needed to request the content
    they seek.

   It is recommended that robot implementers
    send some basic header information to notify
    the site of the capabilities of the robot, the robot
    identify, and where it originated.

                                                           14
Identifying Request Header
   User-Agent
       Tell the server the robot’s name
   From
       Tell the email of the robot’s user/admin email.
   Accept
       Tell the server what media types are okay to send.
        (e.g. only fetch text and sound).
   Referer
       Tell the server how a robot found links to this site’s
        content.


                                                                 15
Virtual docroots cause trouble if
 no Host header is sent


              Robot tries to request index.html
              from www.csie.ncnu.edu.tw, but does
                                                     Servers is configured to serve
              not include a Host header.
                                                     both sites, but serves
                                                     www.ncnu.edu.tw by default.
Web robot client
Request message
GET /index.html HTTP/1.0
User-agent: ShopBot 1.0
                                                        www.ncnu.edu.tw
                                                       www.csie.ncnu.edu.tw
                                     Response message
                                      HTTP/1.0 200 OK
                                      […]
                                      <HTML>
                                      <TITLE>National Chi Nan University</TITLE>
                                      […]                                        16
What else a robot should support
   Support Virtual Hosting
        Not including this can lead to robots identifying the wrong content with
         a particular URL.

   Conditional Requests
        To minimize the amount of content retrieved, by conditional HTTP
         requests. (like cache revalidation)

   Response Handling
        Status code: 200 OK, 404 Not Found, 304
        Entities: <meta http-equiv=“refresh” content”1; URL=index.html”>

   User-Agent Targeting
        Web master should keep in mind that many robot will visit their site.
         Many sites optimize content for various user agents (I.E. or netscape).
        Problem: “your browser does not support frame.”


                                                                                    17
Misbehaving Robots
   Runaway robot
       Robots issue HTTP requests as fast as they can.
   Stale URLs
       Robots visit the old lists of URLs.
   Long, wrong URLs
       May reduce web server’s performance, clutter server’s access
        logs, even crash server.
   Nosy robots
       Some robots may get URLs that point to private data and make
        that data easily accessible through search engine.
   Dynamic gateway access
       Robots don’t always know what they are accessing.


                                                                       18
Excluding Robots


                                          www.ncnu.edu.tw


Robot parses the robots.txt file and
determines if it is allowed to access
the acetylene-torches.html file.

It is, so it proceeds with the request.




                                                            19
robots.txt format
   #allow google, csiebot to crawl the public parts
    of our site, but no other robots are allowed to
    crawl anything of our sites
   User-Agent: googlebot
   User-Agent: csiebot
   Disallow: /private

   User-Agent: *
   Disallow:
                                                       20
Robots Exclusion Standard
        versions

Version Title and description                              Date
0.0      A Standard for Robot Exclusion-Martijn Koster’s   June 1994
         original robot.txt mechanism with Disallow
         directive


1.0      A Method for Web Robots Control-Martijn           Nov. 1996
         Koster’s IETF draft with additional support for
         Allow


2.0      An Extended Standard for Robot Exclusion-Sean     Nov. 1996
         Conner’s extension including regex and timing
         information; not widely supported




                                                                       21
Robots.txt path matching
        examples
Rule path          URL path           Match?   Comments
/tmp               /tmp               ˇ        Rule path==URL path

/tmp               /tmpfile.html      ˇ        Rule path is a prefix of URL
                                               path
/tmp               /tmp/a.html        ˇ        Rule path is a prefix of URL
                                               path
/tmp/              /tmp               x        /tmp/ is not a prefix of /tmp

                   README.TXT         ˇ        Empty rule path matches
                                               everything
/~fred/hi.html     %7Efred/hi.html    ˇ        %7E is treated the same as ~

/%7Efred/hi.html   /~fred/hi.html     ˇ        %7E is treated the same as ~

/%7efred/hi.html   /%7Efred/hi.html   ˇ        Case isn’t significant in escapes

/~fred/hi.html     ~fred%2Fhi.html    x        %2F is slash, but slash is a
                                               special case that must match
                                               exactly                             22
HTML Robot-control Meta Tags
   e.g.
        <META NAME=“ROBOTS” CONTENT=directive-list>

   Directive-list
        NOINDEX
             Not to process this document content
        NOFOLLOW
             Not to crawl any outgoing links from this page

        INDEX
        FOLLOW
        NOARCHIVE
             Should not cache a local copy of the page
        ALL (equivalent to INDEX, FOLLOW)
        NONE (equivalent to NOINDEX, NOFOLLOW)


                                                               23
Additional META tag directives

name=                content=      Description
DESCRIPTION          <text>        Allows an author to define a short text summary of the web
                                   page. Many search engines look at META DESCROPTION
                                   tags,allowing page author to specify appropriate short
                                   abstracts to describe their web pages.
                                   <meta name=“description”
                                       content=“Welcome to Mary’s Antiques web site”>
KEYWORDS             <comma        Associates a comma-separated list of words that describes the
                     list>         web page, to assist in keyword searches.
                                   <meta name=“keywords”
                                      content=“antiques,mary,furniture,restoration”>

REVISIT-AFTER*       <no.days>     Instructs the robot or search engine that the page should be
                                   revisited, presumably because it is subject to change, after the
                                   specified number of days.
                                   <meta name=“revisit-after” content=“10 days”>


*   This directive is not likely to have wide support.                                                24
Guidelines for web robot
operators (Robot Etiquette)




                              25
Guidelines for web robot
operators (cont.)




                           26
Guidelines for web robot
operators (cont.)




                           27
Guidelines for web robot
operators (cont.)




                           28
Guidelines for web robot
operators (cont.)




                           29
Modern Search Engine
             Architecture


      User
                                                                            Web server




      User
                                                                            Web server
                   Web search                      Search engine
                   gateway                         crawler/indexer
      User
                                 Full-text index
                                 database
                                                                            Web server
      User

Web search users       Query engine                       Crawling and indexing
                                                                                         30
Full-Text Index




                  31
Posting the Query
User fills out HTML search
form (with a GET action
HTTP method) on site in
browser and hits Submit




          Client                                                     Query:”drills”

Request message
                                                                     Results:File”BD.html”
GET /search.html?query=drills HTTP/1.1
Host: www.csie.ncnu.edu.tw               www.csie.ncnu.edu.tw
Accept: *
User-agent: ShopBot
                                           Response message                                  Search gateway
                                           HTTP/1.1 200 OK
                                           Content-type: text/html
                                           Content-length: 1037

                                           <HTML>
                                           <HEAD><TITLE>Search Results</TITLE>
                                           […]
                                                                                                       32
Reference (HW#4)
 paper reading: “searching the Web”
 paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001.
http://www.searchtools.com
  Search Tools for Web Sites and Intranets-resources for search tools and
  robots.
http://www.robotstxt.org/wc/robots.html
  The Web Robots Pages-resources for robot developers, including the
  registry of Internet Robots.
http://www.searchengineworld.com
  Search Engine World-resource for search engines and robots.
http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm
 RobotRules Perl source.
http://www.conman.org/people/spc/robots2.html
 An Extended Standard for Robot Exclusion.
Managing Gigabytes: Compressing and Indexing Documents and Images
  Written, I., Moffat, A., and Bell, T., Morgan Kaufmann.                     33

Mais conteúdo relacionado

Destaque

Destaque (10)

R&amp;b history
R&amp;b historyR&amp;b history
R&amp;b history
 
How to create favicon
How to   create    faviconHow to   create    favicon
How to create favicon
 
how to create a blog on wordpress
how to create  a blog  on  wordpress how to create  a blog  on  wordpress
how to create a blog on wordpress
 
How to create rss feed for your website
How to create  rss feed  for  your  websiteHow to create  rss feed  for  your  website
How to create rss feed for your website
 
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
Chapple, R. M. 2014 A Game of Murals. Westeros & Changing Times in East Belfa...
 
How to create rss feed
How to create rss feedHow to create rss feed
How to create rss feed
 
How to track website visitors using Google analytics
How to track website visitors using Google analyticsHow to track website visitors using Google analytics
How to track website visitors using Google analytics
 
how to setup Google analytics tracking code for website
how to setup  Google analytics tracking code for websitehow to setup  Google analytics tracking code for website
how to setup Google analytics tracking code for website
 
How to create sitemap for website
How to create sitemap for websiteHow to create sitemap for website
How to create sitemap for website
 
Evareporte
EvareporteEvareporte
Evareporte
 

Semelhante a Introduction to "robots.txt

Drupal is not your Website
Drupal is not your Website Drupal is not your Website
Drupal is not your Website
Phase2
 

Semelhante a Introduction to "robots.txt (20)

Web Development Presentation
Web Development PresentationWeb Development Presentation
Web Development Presentation
 
HTML5 Real-Time and Connectivity
HTML5 Real-Time and ConnectivityHTML5 Real-Time and Connectivity
HTML5 Real-Time and Connectivity
 
WEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web DevelopmentWEB I - 01 - Introduction to Web Development
WEB I - 01 - Introduction to Web Development
 
Top 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud DevelopersTop 10 HTML5 Features for Oracle Cloud Developers
Top 10 HTML5 Features for Oracle Cloud Developers
 
Of CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills securityOf CORS thats a thing how CORS in the cloud still kills security
Of CORS thats a thing how CORS in the cloud still kills security
 
Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012 Technical SEO | Joomla Day Chicago 2012
Technical SEO | Joomla Day Chicago 2012
 
Publishing strategies for API documentation
Publishing strategies for API documentationPublishing strategies for API documentation
Publishing strategies for API documentation
 
WebCrawler
WebCrawlerWebCrawler
WebCrawler
 
Browser Internals-Same Origin Policy
Browser Internals-Same Origin PolicyBrowser Internals-Same Origin Policy
Browser Internals-Same Origin Policy
 
Web Scraping
Web ScrapingWeb Scraping
Web Scraping
 
Webbasics
WebbasicsWebbasics
Webbasics
 
improve website performance
improve website performanceimprove website performance
improve website performance
 
Web development using ASP.NET MVC
Web development using ASP.NET MVC Web development using ASP.NET MVC
Web development using ASP.NET MVC
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Drupal is not your Website
Drupal is not your Website Drupal is not your Website
Drupal is not your Website
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Rendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rankRendering: Or why your perfectly optimized content doesn't rank
Rendering: Or why your perfectly optimized content doesn't rank
 
Kotlin server side frameworks
Kotlin server side frameworksKotlin server side frameworks
Kotlin server side frameworks
 
From ZERO to REST in an hour
From ZERO to REST in an hour From ZERO to REST in an hour
From ZERO to REST in an hour
 
Unit 02: Web Technologies (1/2)
Unit 02: Web Technologies (1/2)Unit 02: Web Technologies (1/2)
Unit 02: Web Technologies (1/2)
 

Mais de Ishan Mishra

Mais de Ishan Mishra (16)

Political Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaignPolitical Strategist India | Significance of social media in political campaign
Political Strategist India | Significance of social media in political campaign
 
Social Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in IndoreSocial Media Agency & Digital Marketing Company in Indore
Social Media Agency & Digital Marketing Company in Indore
 
Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020Best Off-page-SEO Techniques for 2020
Best Off-page-SEO Techniques for 2020
 
SEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company IndoreSEO Services Indore, SEO Indore, SEO Company Indore
SEO Services Indore, SEO Indore, SEO Company Indore
 
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
ISHANTECH - AN INTERACTIVE MARKETING AGENCY SPECIALIZING IN SEO, PPC, CRO, CV...
 
Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015Top 15 personal finance tips in 2015
Top 15 personal finance tips in 2015
 
Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India Buy vs rent 2015 in India | Real Estate Guide 2015 India
Buy vs rent 2015 in India | Real Estate Guide 2015 India
 
AdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad RevenueAdSense Optimization Tips for increased ad Revenue
AdSense Optimization Tips for increased ad Revenue
 
Online Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of TraveOnline Travel Agency Report on Social Media Habits of Trave
Online Travel Agency Report on Social Media Habits of Trave
 
Management lesson from Mahabharat
Management lesson from MahabharatManagement lesson from Mahabharat
Management lesson from Mahabharat
 
Atif Aslam's Biography
Atif Aslam's BiographyAtif Aslam's Biography
Atif Aslam's Biography
 
Inbound Marketing Agency India | ISHAN-Tech
Inbound Marketing Agency India  | ISHAN-TechInbound Marketing Agency India  | ISHAN-Tech
Inbound Marketing Agency India | ISHAN-Tech
 
Crystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompaniesCrystal IT Park Indore IT ccompanies
Crystal IT Park Indore IT ccompanies
 
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
 Global Management Consulting, Technology and Outsourcing Services from ISHAN... Global Management Consulting, Technology and Outsourcing Services from ISHAN...
Global Management Consulting, Technology and Outsourcing Services from ISHAN...
 
ISHAN-TECH Consulting
ISHAN-TECH ConsultingISHAN-TECH Consulting
ISHAN-TECH Consulting
 
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
Online Marketing Company, Social Media Marketing, Digital Marketing, Indore, ...
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 

Introduction to "robots.txt

  • 1. Web Robots ISHAN MISHRA www.IshanTech.org 1
  • 2. Outline  Robot applications  How it works  Cycle Avoidance 2
  • 3. Applications  Behavior of web robots  Wander from web site to site (recursively),  1. Fetching content,  2. Following hyperlinks,  3. Process the data they find.  Colorful names  Crawlers,  Spiders,  Worms,  Bots 3
  • 4. Where to Start: The “Root Set” A G L S B C D M N T U H I J O E F K P Q R 4
  • 5. Cycle Avoidance A B E B E B E AB A C A C A ABC C D D D (a) Robot fetches page A, (b) Robot follows link (c) Robot follows link and follows link, fetches B and fetches page C is back to A 5
  • 6. Loops  Cycles are bad for crawlers for there reasons.  Spending robot’s time and space  Overwhelm the web site.  Duplicate content. 6
  • 7. Data structure for robot  Trees and hash table  Lossy presence bit maps  Checkpoints  Save the list of visited URL to disk, in case the robot crashes  Partitioning  Robot farms 7
  • 8. Canonicalizing URLs  Most web robots try to eliminate the obvious aliases by “canonicalizing” URL into a standard form, by:  adding “:80” to the hostname, if the port isn’t specified.  Converting all %xx escaped characters into their character equivalents.  Removing # tags 8
  • 9. Symbolic link cycles / / index.html subdir index.html subdir index.html logo.gif (a) subdir is a directory (b) subdir is an upward symbolic link 9
  • 10. Dynamic Virtual Web Spaces  It can be possible to publish a URL that looks like a normal file but really is a gateway application.  This application can generate HTML on the fly that contains links to imaginary URLs on the same server. When these imaginary URLs are requested, new imaginary URLs are generated.  Such kind of malicious web server take the poor robot on an Alice-in-Wonderland journey through an infinite virtual space, even if the web server doesn’t really contain any files. Sometimes the robot is hard to detect this trap, because HTML and URLs may look very different all the time.  For example, a CGI-based calendaring program 10
  • 11. Malicious dynamic web space example 11
  • 12. Techniques for avoiding loops  Canonicalizing URLs  Breath-first crawling  Throttling  Limit the number of pages the robot can fetch from a web site in a period of time.  Limit URL size  Avoid symbolic cycle problem.  Problem: many sites use URLs to maintain user state.  URL/site blacklist  vs. “excluding Robot” 12
  • 13. Techniques for avoiding loops  Pattern detection  e.g., “subdir/subdir/subdir…”  e.g., “subdir/images/subdir/images/subdir/…”  Content fingerprinting  A checksum concept, while the odds of two different pages having the same check sum are small.  Message digest functions such as MD5 are popular for this purpose.  Human monitoring  Should design your robot with diagnostics and logging, so human beings can easily monitor the robot’s process and be warned quickly if something unusual is happening. 13
  • 14. Robotic HTTP  No different from any other HTTP client program.  Many robots try to implement the minimum amount of HTTP needed to request the content they seek.  It is recommended that robot implementers send some basic header information to notify the site of the capabilities of the robot, the robot identify, and where it originated. 14
  • 15. Identifying Request Header  User-Agent  Tell the server the robot’s name  From  Tell the email of the robot’s user/admin email.  Accept  Tell the server what media types are okay to send. (e.g. only fetch text and sound).  Referer  Tell the server how a robot found links to this site’s content. 15
  • 16. Virtual docroots cause trouble if no Host header is sent Robot tries to request index.html from www.csie.ncnu.edu.tw, but does Servers is configured to serve not include a Host header. both sites, but serves www.ncnu.edu.tw by default. Web robot client Request message GET /index.html HTTP/1.0 User-agent: ShopBot 1.0 www.ncnu.edu.tw www.csie.ncnu.edu.tw Response message HTTP/1.0 200 OK […] <HTML> <TITLE>National Chi Nan University</TITLE> […] 16
  • 17. What else a robot should support  Support Virtual Hosting  Not including this can lead to robots identifying the wrong content with a particular URL.  Conditional Requests  To minimize the amount of content retrieved, by conditional HTTP requests. (like cache revalidation)  Response Handling  Status code: 200 OK, 404 Not Found, 304  Entities: <meta http-equiv=“refresh” content”1; URL=index.html”>  User-Agent Targeting  Web master should keep in mind that many robot will visit their site. Many sites optimize content for various user agents (I.E. or netscape).  Problem: “your browser does not support frame.” 17
  • 18. Misbehaving Robots  Runaway robot  Robots issue HTTP requests as fast as they can.  Stale URLs  Robots visit the old lists of URLs.  Long, wrong URLs  May reduce web server’s performance, clutter server’s access logs, even crash server.  Nosy robots  Some robots may get URLs that point to private data and make that data easily accessible through search engine.  Dynamic gateway access  Robots don’t always know what they are accessing. 18
  • 19. Excluding Robots www.ncnu.edu.tw Robot parses the robots.txt file and determines if it is allowed to access the acetylene-torches.html file. It is, so it proceeds with the request. 19
  • 20. robots.txt format  #allow google, csiebot to crawl the public parts of our site, but no other robots are allowed to crawl anything of our sites  User-Agent: googlebot  User-Agent: csiebot  Disallow: /private  User-Agent: *  Disallow: 20
  • 21. Robots Exclusion Standard versions Version Title and description Date 0.0 A Standard for Robot Exclusion-Martijn Koster’s June 1994 original robot.txt mechanism with Disallow directive 1.0 A Method for Web Robots Control-Martijn Nov. 1996 Koster’s IETF draft with additional support for Allow 2.0 An Extended Standard for Robot Exclusion-Sean Nov. 1996 Conner’s extension including regex and timing information; not widely supported 21
  • 22. Robots.txt path matching examples Rule path URL path Match? Comments /tmp /tmp ˇ Rule path==URL path /tmp /tmpfile.html ˇ Rule path is a prefix of URL path /tmp /tmp/a.html ˇ Rule path is a prefix of URL path /tmp/ /tmp x /tmp/ is not a prefix of /tmp README.TXT ˇ Empty rule path matches everything /~fred/hi.html %7Efred/hi.html ˇ %7E is treated the same as ~ /%7Efred/hi.html /~fred/hi.html ˇ %7E is treated the same as ~ /%7efred/hi.html /%7Efred/hi.html ˇ Case isn’t significant in escapes /~fred/hi.html ~fred%2Fhi.html x %2F is slash, but slash is a special case that must match exactly 22
  • 23. HTML Robot-control Meta Tags  e.g.  <META NAME=“ROBOTS” CONTENT=directive-list>  Directive-list  NOINDEX  Not to process this document content  NOFOLLOW  Not to crawl any outgoing links from this page  INDEX  FOLLOW  NOARCHIVE  Should not cache a local copy of the page  ALL (equivalent to INDEX, FOLLOW)  NONE (equivalent to NOINDEX, NOFOLLOW) 23
  • 24. Additional META tag directives name= content= Description DESCRIPTION <text> Allows an author to define a short text summary of the web page. Many search engines look at META DESCROPTION tags,allowing page author to specify appropriate short abstracts to describe their web pages. <meta name=“description” content=“Welcome to Mary’s Antiques web site”> KEYWORDS <comma Associates a comma-separated list of words that describes the list> web page, to assist in keyword searches. <meta name=“keywords” content=“antiques,mary,furniture,restoration”> REVISIT-AFTER* <no.days> Instructs the robot or search engine that the page should be revisited, presumably because it is subject to change, after the specified number of days. <meta name=“revisit-after” content=“10 days”> * This directive is not likely to have wide support. 24
  • 25. Guidelines for web robot operators (Robot Etiquette) 25
  • 26. Guidelines for web robot operators (cont.) 26
  • 27. Guidelines for web robot operators (cont.) 27
  • 28. Guidelines for web robot operators (cont.) 28
  • 29. Guidelines for web robot operators (cont.) 29
  • 30. Modern Search Engine Architecture User Web server User Web server Web search Search engine gateway crawler/indexer User Full-text index database Web server User Web search users Query engine Crawling and indexing 30
  • 32. Posting the Query User fills out HTML search form (with a GET action HTTP method) on site in browser and hits Submit Client Query:”drills” Request message Results:File”BD.html” GET /search.html?query=drills HTTP/1.1 Host: www.csie.ncnu.edu.tw www.csie.ncnu.edu.tw Accept: * User-agent: ShopBot Response message Search gateway HTTP/1.1 200 OK Content-type: text/html Content-length: 1037 <HTML> <HEAD><TITLE>Search Results</TITLE> […] 32
  • 33. Reference (HW#4)  paper reading: “searching the Web”  paper reading: “Hyperlink analysis for the Web,” IEEE Internet Computing, 2001. http://www.searchtools.com Search Tools for Web Sites and Intranets-resources for search tools and robots. http://www.robotstxt.org/wc/robots.html The Web Robots Pages-resources for robot developers, including the registry of Internet Robots. http://www.searchengineworld.com Search Engine World-resource for search engines and robots. http://search.cpan.org/dist/libwww-perl/lib/WWW/RobotRules.pm RobotRules Perl source. http://www.conman.org/people/spc/robots2.html An Extended Standard for Robot Exclusion. Managing Gigabytes: Compressing and Indexing Documents and Images Written, I., Moffat, A., and Bell, T., Morgan Kaufmann. 33