2. COLLECTION METHODS
WWW = Interplay between = Web Client + Web Server
Web server stores contents in HTML pages or images which it
can delivers/serves to a web browser in response to the request of
that web browser.
A web browser request content from web server and than provide
the received contents to the user.
3. Mechanism of Interaction
The protocol defines the standard format for communication between
the server and the browser.
Example :
The most commonly used protocol on the web is HTTP (hyper text
transfer protocol). When a browser sends a request to the web server
that request takes the format of HTTP message. Same reply would be
done from the server side.
4. URL (Unified Resource Locator)
All the contents on the web server is identified
by using a uniform resource locator (URL). A
reference which describe where the content on
web is located.
5. Fundamental Categories of Collection
There two types of collection techniques
1- Content driven collection methods
2- Event driven collection methods
6. Content Driven Collection Methods
Seek to archive the underlying content of the website.
Event Driven Collection Methods
Collect the actual transaction that occur
Further distinctions can be made based on the source from which
the contents is collected. It can be archived from the
1- Web Server (Server Side Collection)
2- Web Browser (Client side collection )
8. Static Websites
A static website consist of a series of pre existing web pages,
each of which is linked to from at least one other page. Each web
page is typically composed of one or more individual elements.
The structure will contained within the HTML document which
contain hyperlinks to other elements, such as images and other
pages.
All elements of the website can be stored in a hierarchal folder
structure on the web server and the URL describes the location
of each element within that structure.
9. Form of URL
The target of a hyperlink is normally specified in the “HREF”
attribute of an HTML element and defines the URL of the target
resource.
The form of the URL may be absolute or relative. These can be
further illustrated by using the following examples
10. Absolute and Relative
In absolute a fully qualified domain and the path name is mentioned.
<A
href = http://www.mysite.com/products/new.html>NewProducts
</A>
In relative, only including the path name relative to the source object is mentioned.
<A href= “new.html” > New Products</A>
11. Dynamic Websites
In a dynamic website the pages are generated from smaller
elements of contents. When a request is received the required
elements are assembled into a web page and delivered. Types of
dynamic contents are
- Databases
- Syndicated Content
- Scripts
- Personalization
12. Databases
The content used to create web pages is often stored in a database,
such as a Content Management System, and dynamically assembled
into web pages.
Scripts
Scripts may be used to generate the dynamic contents, responding
differently depending on the values of certain variables, such as the
date, type of browser making the request, or identity of the user.
13. Syndicated Content
A website may include content which is drawn from external
resources, such as pop ups or RSS feeds and than dynamically
inserted into the web pages.
Personalization
Many websites make increasing use of personalization, to deliver
content which is customized to an individual user.
Example : Cookies may be used to store information about a
user’s computer and returned by their browser whenever
they make a request to that website.
14. Depending on the nature of a dynamic website these virtual pages
may be linked to from other pages or may only be available
through searching. Websites may contain both static and dynamic
elements.
Example:
The home page and other pages that only change infrequently,
may be static, whereas pages are updated on a regular basis such
as product catalogue, may be dynamic
15. The Matrix of Collection Method
The range of possible methods for collecting web content is dictated by these
considerations. Four alternative collection methods are currently available.
Table 4.1 The Matrix of Collection Methods
Content Driven Event Driven
Client Side Remote Harvesting No method available
Server Side Direct Transfer Transactional Archiving
Database Archiving
16. Direct Transfer
The simplest method of collecting web resources is to acquire a
copy of the data directly from the original resource. This
approach which requires direct access to the host web server, and
therefore the co-operation of the website owner, involves copying
the selected resources from the web server and transferring them
to the collecting institution, either on removable device such as
CD, or online using email FTP.
17. Direct Transfer
Direct transfer is most suited for static websites which only
comprise HTML documents and other objects stored in a
hierarchal folder structure on the web server. The whole or a part
websites can be acquired simply by copying the relevant files and
folders to the collecting institutions storage system.
The copies website will function in precisely the same way as the
original one but with two limitations.
- The hyperlinks should be relative not absolute
- Any functionality in the original website will no longer be
operable unless the appropriate search engine is installed in
the new environment.
18. Strengths
The principal advantage of the direct transfer method is that it
potentially offers the most authentic rendition of the collected
website. By collecting from source, it is possible to ensure that
the complete content is captured with its original structure. In
effect the collecting institution re-host a complete copy of the
original website. The degree of authenticity which it is possible
to
recreate will depend upon the complexities of the technical
dependencies, and the extent to which the collecting institution is
capable of reproducing them.
19. Limitations
The major limitation of this approach are
- The resources required to effect each transfer, and
sustainability of the supporting technologies.
- This method requires cooperation on the part of the website
owner, to provide both the data and the necessary
documentation.
20. Go through the
Case Study: Bristol Royal Infirmary Inquiry
See on page : 48 from the book
21. Database Archiving
The increase use of web database have made the development of
the new web archiving tools a priority and such tools are now
beginning to appear. The process of archiving database driven
sites involved three stages…
1- The repository defines a standard data model and format for
archived database.
2- Each source database is converted to that standard format.
3- A standard access interface is provided to the archived
database.
22. Database Format
The obvious technology to use for define archiving database
format is XML, which is an open standard specifically designed
for transforming data structures. Several tools are available which
converts the proprietary database to XML format.
23. Tools for Conversion in XML
- SIARD = Swiss Federal Archive
- DEEPARC = Bibliotheque Nationale de France
Both of these tools allow the structure and content of a relational
database to be exported into standard formats.
24. SIARD
The workflow of SIARD is
1- Automatically analysis and maps the database structure of the
source database.
2- Export the definition of the database structure as a text file
containing the data definition described using SQL.
3- The content is exported as plain text files together with any
large binary objects stored in the database and the metadata is
exported as a XML document.
4- The data can then be related into any relational database
management system to provide access.
25. DeepArc
- It enables a user to map the relational database model of the
original database to an XML schema and then export the
context of the database into an XML document.
- It is intended to be used by the database owner since its use in
any particular case requires detailed knowledge of the
underlying structure of the database being archived.
26. Flow of Work of DeepArc Tool
• First the user creates a view of the database called skeleton
which is created by using XML
• That skeleton describe the desired structure of the XML
documents that will be generated from the database.
• The user than builds the associations to map the database to
this view.
27. • This entails mapping both the database structure (i.e. the
tables) and the contents (i.e columns within that tables) once
these associations have been creaed and configured the user
can then export the content of the database into XML
document which conforms to the defined schema.
• If the collecting institution defines a standard XML data
model for its archived database, it can therefore use a tool such
as DeepArc to transform each database to that structure.
28. Strengths
It offers a generic approach to collecting and preserving database
content which avoids the problems of supporting multiple
technologies incurred by alternative approach of direct transfer.
This limits issues of preservation and access to a single format,
against which all resources cab be brought to bear. For example,
archives can use standard access interfaces such as that provided
by the XINQ Tool
29. Limitations
• Web database archiving tools are a recent development and are
therefore still technology immature compared to some other
collection methods.
• Supporting Technologies is currently limited
• Nature and timings of collection
• Original ‘look and feel’. (It should collect the website rather
than the collection of database content)
• Active cooperation and participation of website owner
30. Remote Harvesting Technique
Remote Harvesting is the most common and most
widely employed method for collecting websites. It
involves the use of web crawler software to
harvest content from remote web servers.
‘Crawlers’ are software programs designed to
interact with the online services like human users,
principally to gather information of the required
content. Most of the search engine use these
crawlers to collect and index web pages.
31. Web Crawler
A web crawler shares many similarities with a desktop web browser, it
submits the HTTP request to a web server and stores the content that
it receives in return. The actions of the web crawler are dictated by a
list of URL’s (or seeds) to visit. The crawler visits the first URL on the
list and collects the web page, identifies all the hyperlinks within the
page, and adds them to the seed list.
In this way, a web crawler that begins on the home page of a web site
will eventually visit every linked page within that website. This is
recursive process and is normally controlled by certain parameters,
such as number of hyperlinks that should be followed.
32. Infrastructure
The infrastructure required to operate a web crawler can be
minimal; the software simply needs to be installed on a computer
system within an available internet connection and sufficient
storage space for the collected data. However in most large scale
archiving programmes, the crawler software is deployed from
networked servers with attached disk or tape storage.
33. Types of Web Crawlers
There is a wide variety of web crawlers software available, both
proprietary and open source. Three most widely used web
crawlers are
1- HTTrack
2- NEDLIB Harvester
3- Heritrix
We have already discuss these web crawlers in the first lecture I
will not discuss here in this lecture again
34. Parameters
Web Crawlers provide a number of parameters can be set to
specify their exact behavior. Many crawlers are highly
configurable, offering a very wide variety of settings. Most
crawlers provide variations on the following parameters.
- Connection
- Crawl
- Collection
- Storage
- Scheduling settings
35. Connection Settings
These setting relate to the manner in which the crawler connects
to web servers.
- Transfer Rate
- Connections
- Transfer Rate: The maximum rate at which the crawler will
attempt to transfer the data. In this way a specific transfer rate
is specified so that the data is captured at a sufficient rate to
enable an entire site to be collected in a reasonable timescale.
- Connections: to specify the number of simultaneous
connections the web crawler can attempt to make with a host,
or the delay between the establishing connections
36. Crawl Settings
These settings allow the user to control the behavior of the
crawler as it traverse a website, such as the direction and depth of
the crawl
- Link depth and Limits
- Robot Exclusion Notices
- Link Discovery
Settings will normally be available to control the size and
duration of the crawl. For example, it may be desirable to halt a
crawl after it has collected a given volume of data, or within a
given timeframe.
37. • Link depths and Limits: This will determine the number of
links that the crawler should follow away from its starting
point, and the direction in which it should move. It is possible
to determine the limit of the crawler in terms of whether or not
the crawler is restricted to follow links within the same path,
website or domain, and to what depth.
• Robot Exclusion Notice: A robot exclusion notice is a
method used by websites to control the behavior of robots
such as web crawlers. It uses a standard protocol to define
which parts of a website are accessible to the robot. These
rules are contained within a ‘robots.txt’ sile in the top level
folder of the website
• Link Discovery:The user may also be able to configure how
the crawler analysis hyperlinks: these links may be
dynamically constructed by scripts, or hidden within content
such as flash files and therefore not transparent to the crawler.
However, more sophisticated crawlers can be configured to
discover many of these hidden links.
38. Collection Settings
These settings allow the user to fine tune the behavior of the
crawler, and particularly to determine the content that is
collected. Filters can be defined to include or exclude certain
paths and file types:
For Example: To exclude links to pop ups advertisements or to
collect only links to PDF files. Filters may also be used to avoid
crawlers traps, whereby the crawler becomes locked into an
endless loop, by detecting repeating patterns of links. The user
may also be able to place limits on the maximum size of files to
be collected.
39. Storage Settings
These settings determine how the crawler stores the collected
content. By default, most crawlers will mirror the original
structure of the website, building a directory structure which
corresponds to the original hierarchy. However it may be
possible
to dictate other options, such as forcing all images to be stored in
a single folder. These options are unlikely to be useful in most
web archiving scenarios, where preservation of the original
structure will be considered desirable. The crawler can rewrite
the hyperlink to convert an absolute link into relative link.
40. Scheduling Settings
Tools such as PANDAS, which provide workflow capabilities,
allow the scheduling of crawls to be controlled. Typical
parameters will include:
Frequency: Daily or Weekly
Dates: Start or Commencement of process
Non-schedule Dates: It may also be possible to define the
specific dates for crawling including the standard schedule.
41. Identifying the Crawler
Software agents such as web browsers and crawlers identify
themselves to the online services with which they connect
through a ‘user agent’ identifier within the HTTP headers of the
requests they send. Thus, internet explorer 6.0; identifies itself
with the user agent Mozilla/4.0(compatiable; MSIE 6.0; windows
NT 5.1). The user agent string displayed by a web crawler can
generally be modified by the user.
42. Advantages of Identification
There are three advantages for this identification..
1- Crawler identify himself that from which institution it belongs
to
2- web servers may be configured to block certain user agents,
including web crawlers and search engines robots. Defining a
more specific user agent ca prevent such blocking, even if
using a crawler that would otherwise be blocked.
3- some websites are designed to display correctly only in certain
browsers and check the user agents in any HTTP request
accordingly. User agents which do not indicate correct
browser compatibility will then be redirected to a warning
page.
43. Strengths
The greatest strengths of remote harvesting is
- Ease of use
- Flexibility
- Widespread applicability
- Availability of number of mature software tools.
- A remote harvesting program can be established very quickly and
allows large number of websites to be collected in a relatively short
period.
- The infrastructure requirements are relatively simple and it requires
no active participation from website owners: the process is entirely
in the control of the archiving body.
- Most web crawlers software is comparatively straight forward to
use, and can be operated by non-technical staff with some training.
44. Limitations
- Careful Configuration
- Inability to collect dynamic contents
- The large volume of data can be archived with
the maximum speed availability which is a
draw back.
45. Transactional Archiving
Transactional archiving is a fundamentally different approach from any
of those previously described, being event driven rather than content
driven.
• Transactional archiving is an event-driven approach, which collects
the actual transactions which take place between a web server and a
web browser. It is primarily used as a means of preserving evidence
of the content which was actually viewed on a particular website, on
a given date. This may be particularly important for organizations
which need to comply with legal or regulatory requirements for
disclosing and retaining information.
• A transactional archiving system typically operates by intercepting
every HTTP request to, and response from, the web server, filtering
each response to eliminate duplicate content, and permanently
storing the responses as bitstreams. A transactional archiving system
requires the installation of software on the web server, and cannot
therefore be used to collect content from a remote website.
46. Example of Transactional Archive
• pageVault supports the archiving of all unique responses
generated by a web server.
• It allows you to know exactly what information you have
published on your web site, whether static pages or
dynamically generated content, and regardless of format
(HTML, XML, PDF, zip, Microsoft Office formats, images,
sound), regardless of rate of change.
• Although every unique HTTP response can be archived and
indexed, you can define non-material content (such as the
current date/time and trivial site personalisation) on a per-
URL, directory or regular expression basis which pageVault
will exclude when calculating the novelty of a response.
47. Strengths
The great strength of transactional archiving is that it
collects what is actually viewed. It offers the best option
for collecting evidence of how a website was used, and
what content was actually available at any given
moment. It can be a good solution for archiving certain
kinds of dynamic website.
48. Limitations
- The transactional collection does not collect content which has
never been viewed by a user.
- Transactional collection takes place on the web server, it
cannot capture variations in the user experience which are
introduced by the web browser.
- Transactional archiving must takes place server side and
therefore requires the active co operation of the web site
owner.
- The time taken for the server to process and respond to each
request will be longer.