TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Open Archives Initiatives For Metadata Harvesting
1. Open Archives Initiatives for Metadata Harvesting
A Framework for Building Open Digital Libraries
Term Paper-1
Submitted by
NIKESH.N
International School of Information Management
University of Mysore
2010
2. Open Archives Initiatives for Metadata Harvesting
A Framework for Building Open Digital Libraries
1.0 Introduction
Digital Library may be defined as system that supports collection, organization, storage, retrieval
and dissemination of Digital Documents. It may be viewed as the intersection of Library Science,
Computer Science and networked information systems. Open movements are gaining acceptance
in the scholarly information arena and many of the Universities and research centers have started
to provide public access to their repositories. With the growing number of repositories of digital
repositories in the Web, it became difficult for the users to visit individual places in search of
information. Many organizational repositories have not been indexed by the search engines. Such
mechanism is therefore required by which the repositories can share the resources and work in
coordination, to provide a broader purview to the users. The mechanism which provides the ability to
the information systems to work in coordination has been termed as Interoperability. Open Archives
Initiative is one of the landmark efforts to ensure the availability of the metadata of digital resources
of many repositories at the users’ end.
The essence of the open archives approach is to enable access to Web-accessible material through
interoperable repositories for metadata sharing, publishing and archiving.
Such interoperability requirements necessitated the development of standards such as the Dublin
Core Metadata Element Set and the Open Archives Initiative's Protocol for Metadata Harvesting
(OAI-PMH). These standards have achieved a degree of success in the DL community largely
because of their generality and simplicity.
2.0 Need for a Harvester protocol
There is a growing need to make resources, not only descriptive metadata, harvestable in an
interoperable manner. There are two major use cases that motivate this need:
• Preservation: The need to periodically transfer digital content from a data repository to one or
more trusted digital repositories charged with storing and preserving safety copies of the
3. content. The trusted digital repositories need a mechanism to automatically synchronize with
the originating data repository.
• Discovery: The need to use content itself in the creation of services. Examples include search
engines that make full-text from multiple data repositories searchable, and citation indexing
systems that extract references from the full-text content. Another scenario is the provision of
thumbnail versions of high-quality images from cultural heritage collections to external
services that build browsing interfaces that include the thumbnails
3.0 OAI Protocol for Metadata Harvesting (OAI-PMH)
In October of 1999 the Open Archives Initiative (OAI) was launched in an attempt to address
interoperability issues among the many existing and independent DLs. The focus was on high-
level communication among systems and simplicity of protocols. The OAI has since received
much media attention in the DL community and, primarily because of the simplicity of its
standards, has attracted many early adopters. It defines a mechanism for harvesting records
containing metadata from repositories.
3.1 Definitions of Key terms
• Open archives Initiatives (OAI)
OAI is an initiative to develop and promote interoperability standards that aim to facilitate the
efficient dissemination of content.
• Archive
The term "archive" in the name Open Archives Initiative reflects the origins of the OAI in
the e-prints community where the term archive is generally accepted as a synonym for
repository of scholarly papers. Members of the archiving profession have justifiably noted
the strict definition of an ?archive? within their domain; with connotations of preservation of
long-term value, statutory authorization and institutional policy. The OAI uses the term ?
archive? in a broader sense: as a repository for stored information. Language and terms are
never unambiguous and uncontroversial and the OAI respectfully requests the indulgence of
the professional archiving community with this broader use of ?archive?
4. (OAI definition quoted from FAQ on OAI Web site)
• OAI Protocol for Metadata Harvesting (OAI-PMH)
OAI-PMH is a lightweight harvesting protocol for sharing metadata between services.
• Protocol
A protocol is a set of rules defining communication between systems. FTP (File Transfer
Protocol) and HTTP (Hypertext Transport Protocol) are examples of other protocols used for
communication between systems across the Internet.
• Harvesting
In the OAI context, harvesting refers specifically to the gathering together of metadata from a
number of distributed repositories into a combined data store.
3.2 Prerequisites to develop metadata harvesting protocol
To facilitate metadata harvesting there needs to be agreement on:
o Transport protocol - HTTP or FTP or other such protocol
o Metadata format - Dublin Core or MARC or other such format
o Metadata Quality Assurance - mandatory element set, naming and subject conventions, etc.
o Intellectual Property and Usage Rights - who can do what with what?
3.3 OAI: Key players
There are two groups of 'participants': Data Providers and Service Providers.
5. Data Providers
(open archives, repositories) provide free access to metadata, and may, but do not necessarily,
offer free access to full texts or other resources. OAI-PMH provides an easy to implement, low
barrier solution for Data Providers.
Service Providers
use the OAI interfaces of the Data Providers to harvest and store metadata. Note that this means
that there are no live search requests to the Data Providers; rather, services are based on the
harvested data via OAI-PMH. Service Providers may select certain subsets from Data Providers
(e.g., by set hierarchy or date stamp). Service Providers offer (value-added) services on the basis
of the metadata harvested, and they may enrich the harvested metadata in order to do so.
3.4 How it works
6. Prerequisites to develop metadata harvesting protocol
To facilitate metadata harvesting there needs to be agreement on:
o Transport protocol - HTTP or FTP or other such protocol
o Metadata format - Dublin Core or MARC or other such format
o Metadata Quality Assurance - mandatory element set, naming and subject conventions, etc.
o Intellectual Property and Usage Rights - who can do what with what?
The OAI-PMH gives a simple technical option for data providers to make their metadata
available to services, based on the open standards HTTP (Hypertext Transport Protocol) and
XML (Extensible Markup Language). The metadata that is harvested may be in any format that
is agreed by a community (or by any discrete set of data and service providers), although
unqualified Dublin Core is specified to provide a basic level of interoperability. Thus, metadata
from many sources can be gathered together in one database, and services can be provided based
on this centrally harvested or "aggregated" data. The link between this metadata and the related
content is not defined by the OAI protocol. It is important to realize that OAI-PMH does not
provide a search across this data, it simply makes it possible to bring the data together in one
place. In order to provide services, the harvesting approach must be combined with other
mechanisms.
3.5 Protocol details
Records
A record is the metadata of a resource in a specific format. A record has three parts: a header and
metadata, both of which are mandatory, and an optional about statement. Each of these is made
up of various components as set out below.
header (mandatory)
identifier (mandatory: 1 only)
7. datestamp (mandatory: 1 only)
setSpec elements (optional: 0, 1 or more)
status attribute for deleted item
metadata (mandatory)
XML encoded metadata with root tag, namespace
repositories must support Dublin Core, may support other formats
about (optional)
rights statements
provenance statements
Datestamps
A datestamp is the date of last modification of a metadata record. Datestamp is a mandatory
characteristic of every item. It has two possible levels of granularity:
YYYY-MM-DD or YYYY-MM-DDThh:mm:ssZ.
The function of the datestamp is to provide information on metadata that enables selective
harvesting using from and until arguments. Its applications are in incremental update
mechanisms. It gives either the date of creation, last modification, or deletion. Deletion is
covered with three support levels: no, persistent, transient.
Metadata schema
OAI-PMH supports dissemination of multiple metadata formats from a repository. The
properties of metadata formats are:
– id string to specify the format (metadataPrefix)
– metadata schema URL (XML schema to test validity)
– XML namespace URI (global identifier for metadata format)
Repositories must be able to disseminate unqualified Dublin Core. Further arbitrary metadata
formats can be defined and transported via the OAI-PMH. Any returned metadata must comply
8. with an XML namespace specification. The Dublin Core Metadata Element Set contains 15
elements. All elements are optional, and all elements may be repeated.
3.6 The Dublin Core Metadata Element Set:
Title Contributor Source
Creator Date Language
Subject Type Relation
Description Format Coverage
Publisher Identifier Rights
Sets
Sets enable a logical partitioning of repositories. They are optional archives do not have to
define Sets. There are no recommendations for the implementation of Sets. Sets are not
necessarily exhaustive of the content of a repository. They are not necessarily strictly
hierarchical. It is important and necessary to have negotiated agreements within communities
defining useful sets for the communities.
• function: selective harvesting (set parameter)
• applications: subject gateways, dissertation search engine, and others
• examples
o publication types (thesis, article, ?)
o document types (text, audio, image, ?)
o content sets, according to DNB (medicine, biology, ?)
3.7 Request format
Requests must be submitted using the GET or POST methods of HTTP, and repositories must
support both methods. At least one key=value pair: verb=RequestType (where RequestType is
9. some type of request such as ListRecords) must be provided. Additional key=value pairs depend
on the request type.
example for GET request: http://archive.org/oai?
verb=ListRecords&metadataPrefix=oai_dc
The encoding of special characters must be supported; for example, ":" (host port separator)
becomes "%3A"
3.8 Response
Responses are formatted as HTTP responses. The content type must be text/xml. HTTP-based
status codes, as distinguished from OAI-PMH errors, such as 302 (redirect) and 503 (service not
available) may be returned. Compression codes are optional in OAI-PMH, only identity
encoding is mandatory. The response format must be well-formed XML with markup as follows:
1. XML declaration
(<?xml version="1.0" encoding="UTF-8" ?>)
2. root element named OAI-PMH with three attributes
(xmlns, xmlns:xsi, xsi:schemaLocation)
3. three child elements
1. responseDate (UTC datetime)
2. request (the request that generated this response)
3. a) error (in case of an error or exception condition)
b) element with the name of the OAI-PMH request
10. 3.9 OAI-
PMH
Verbs
Here ‘verb’
means
request type which the service provider/harvester sends to get responses from data providers. There is
a standard set of 6 verbs:
o Identify
o ListMetadataFormats
o ListSets
o GetRecord
o ListIdentifiers
o ListRecords
Function
Identify Description of repository
ListMetadataFormats Metadata format supported by the repository
ListSets Sets defined by repository
ListIdentifiers Retrieves unique identifiers of the item
ListRecords Used to harvest records from the repository
GetRecords Retrieves individual metadata record from the
repository
11. A harvester is not required to use all types. However, a repository must implement all types.
There are required and optional arguments, depending on request types.
4.0 Dspace : OAI compatible Digital Library Software
DSpace is open source software for building and managing Digital repositories. Developed jointly by
MIT Libraries and Hewlett-Packard (HP), is freely available to research institutions as an open
source system that can be customized and extended. DSpace is a digital institutional repository that
captures, stores, indexes, preserves, and redistributes content in digital formats. Institutional
Repository is a set of services that a research institution/ organization/ university offers to the
members of its community for the management and dissemination of digital
materials created by the institution and its community members Typically, DSpace has been
deployed for Institutional Repositories of publications, thesis and dissertations. There are several
groups working on extending its capabilities such implementation of ontologies in search interface
and for submission module, customization for management of electronic theses and dissertations and
for localization and international of the package for the world languages.
Dspace is compliant with OAI-PMH ver 2.0 and metadata in Dspace digital libraries can be
harvested.
4.1 DSpace Search System
The end user can browse, search and access the collections using the hierarchies and also the
alphabetic bar menu. For searching the collection, Dspace uses Lucene Search Engine, which is a
part of Apache Jakarta Project (1). Additionally research projects such as the …(Portugal)…
provides Ontologies that enables context based querying. This work like subject based directory
structures.
Lucene search engine has very powerful search features that encompass many search approaches of
the end-user. It provides the basic ‘exact term’ or keyword search. In addition it allows fielded search
akin the field level search of library databases. In Dspace, Dublin Core elements are used for the field
names. Lucene also facilitates Boolean search, range searches, term boosting and proximity searches.
The interesting search facility lucene uses fuzzy logic that is based on the Levenstien’s alogorithm
(5) that can replace and match terms by similarity. This feature is especially useful in instances where
we hear a term and guess it spellings and more so in the case of personal names.
12. 4.2 Metadata in Dspace
DSpace users deal with/come across metadata in the following modules:
D Administration modules: Dublin core registry, administrative metadata- default values, mail
alert to subscribers
a Submission modules: descriptive metadata
a Harvesting – OAI-PMH using the DC elements (unqualified)
a Search result display: brief and full metadata
4.3 Metadata harvesting in Dspace
Dspace is compliant with the OAI-PMH for exposing metadata. OAI-PMH allows repositories to
expose an hierarchy of sets in which records may be placed. DSpace exposes collections as sets.
Each collection has a corresponding OAI set and harvestors use a verb (OAI- command) ListSets, to
discover the sets. Only the 15 basic Dublin Core elements is exposed at present.
5.0 OAI Harvester Software
o Arc (http://arc.cs.odu.edu/)
o Citebase (http://citebase.eprints.org/cgi-bin/search)
o CYCLADES (http://www.ercim.org/cyclades/)
o DP9 (http://arc.cs.odu.edu:8080/dp9/index.jsp)
o MeIND (http://www.meind.de/)
o METALIS (http://metalis.cilea.it/)
o my.OAI (http://www.myoai.com)
o NCSTRL (http://www.ncstrl.org/)
o Purseus (http://www.perseus.tufts.edu/cgi-bin/vor)
o Public Knowledge Project – Open Archives Harvester (http://pkp.ubc.ca/harvester/)
o OAICAT (http://www.oclc.org/research/software/oai/cat.htm)
o OAI Repository Explorer (http://re.cs.uct.ac.za/)
o OAIster (http://oaister.umdl.umich.edu/o/oaister/)
o OASIC (Open Archvies en SIC) (http://oasic.ccsd.cnrs.fr/)
o OAIHarvester (http://www.oclc.org/research/software/oai/harvester.htm)
o DLESE OAI Software (http://dlese.org/oai/index.jsp)
6.0 Future Prospects
13. Some more work has to be done in order to make OAI-PMH as a complete globally accepted
metadata harvesting protocol:
o Tools and software has to be developed by which the non-OAI-PMH compliant repositories
can be converted into OAI-PMH compliant so that the repository can be made data provider.
o The higher versions of the protocol should be made compatible of the lower ones.
At metadata creation level some standardization is required, as a particular resource is described
inconsistently at different repositories. Vocabulary control measures should be also taken care of.
Still some more improvements are awaited in OAI-PMH protocol, and then only we can ensure
a comprehensive view of the resources available on a particular subject to our end-users.
7.0 Conclusion
Much promise is seen for the use of the protocol within an open archives approach. Support for a
new pattern for scholarly communication is the most publicized potential benefit. Perhaps most
readily achievable are the goals of surfacing 'hidden resources' and low cost interoperability.
Although the OAI-PMH is technically very simple, building coherent services that meet user
requirements remains complex. The OAI-PMH protocol could become part of the infrastructure
of the Web, as taken-for-granted as the HTTP protocol now is, if a combination of its relative
simplicity and proven success by early implementers in a service context leads to widespread
uptake by research organizations, publishers and archives.
REFERENCES
1. http://www.openarchives.org/
2. Breeding, M. (2002, April). The Emergence of the Open Archives Initiative: This Protocol
could become a key part of the digital library infrastructure. Information Today.
from http://www.findarticles.com/cf_0/m3336/4_19/85251474/p1/article.jhtml
3. Breeding, M. (2002). Understanding the Protocol for Metadata Harvesting of the Open
Archives Initiative. Computers in Libraries, 22(8).
4. Lagoze, C., & Sompel, H. V. d. (2001, January). The Open Archives Initiative Protocol for
Metadata Harvesting,from http://www.openarchives.org/OAI/openarchivesprotocol.htm
14. 5. Lynch, C. A. (2001, August). Metadata Harvesting and the Open Archives Initiative. ARL
Bimonthly Report 217. from http://www.arl.org/newsltr/217/mhp.html
6. Shearer, K. (2002, March). The Open Archives Initiative: Developing an Interoperability
Framework for Scholarly Publishing. CARL/ABRC Background Series, No. 5. from
http://www.carl-abrc.ca/projects/scholarly/open_archives.PDF
7. Suleman, H., & Fox, E. A. (2001, December). A Framework for Building Open Digital
Libraries. D-Lib Magazine, 7(12). from
http://www.dlib.org/dlib/december01/suleman/12suleman.html
8. Sompel, H. V. d., & Lagoze, C. (2000, February). The Santa Fe Convention of the Open
Archives Initiative. D-Lib Magazine, 6(2). from http://www.dlib.org/dlib/february00/vandesompel-
oai/02vandesompel-oai.html
9. Warner, S. (2001, June). Exposing and Harvesting Metadata Using the OAI Metadata
Harvesting Protocol: A Tutorial. HEP Libraries Webzine Issue 4. from
http://library.cern.ch/HEPLW/4/papers/3/
11 . http://www.ukoln.ac.uk/repositories/digirep/index/FAQs
12 . Michael Shepherd, (2003), Interoperability for Digital Libraries, DRTC Workshop on
Semantic Web 8th – 10th December, 2003,DRTC, Bangalore
13 . http://www.openarchives.org/Register/BrowseSites
14 . http://www.openarchives.org/service/listproviders.html