Over the past decade, as the scholarly community’s reliance on e-content has increased, so too has the development of preservation-related digital repositories. The need for descriptive, administrative, and structural metadata for each digital object in a preservation repository was clearly recognized by digital archivists and curators. However, in the early 2000’s, most of the published specifications for preservation-related metadata were either implementation specific or broadly theoretical. In 2003, the Online Computer Library Center (OCLC) and Research Libraries Group (RLG) established an international working group called PREMIS (Preservation Metadata: Implementation Strategies) to develop a common core set of metadata elements for digital preservation. The first version of the PREMIS Data Dictionary for Preservation Metadata and its supporting XML schema was issued in 2005. Experience using its specifications in preservation repositories has led to several revisions, with the completion of a version 2.0 in 2008. The Data Dictionary is now in version 2.2 (July 2012), and it is widely implemented in preservation repositories throughout the world in multiple domains.
2. Metadata for Preservation: A Digital
Object’s Best Friend
Introduction to Preservation Metadata
Rebecca Squire Guenther
Library of Congress, NDMSO and
Consultant, meetyourdata.com
rguenther52@gmail.com
NISO Webinar, Feb. 13, 2013
3. Digital preservation: imperative and challenge
More and more of scholarly and cultural record exists in digital
form; steps must be taken to secure its long-term future
Groups such as Digital Preservation Coalition, NDIIPP and National
Digital Stewardship Alliance have made significant progress in
raising awareness about digital preservation imperative
Gradual shift in focus from articulating problem to solving it …
• Not so much “Why is digital preservation important” anymore;
rather, “What must be done to achieve preservation objectives?”
Many practical challenges in implementing reliable, sustainable
digital preservation programs
One key challenge: preservation metadata
4. Metadata and preservation metadata
PRESERVATION
“Structured information that METADATA
describes, explains, locates,
or otherwise makes it easier to
retrieve, use, or manage an
information resource”
“Metadata that supports
and documents the digital
preservation process”
METADATA
5. Preservation
Preservation metadata includes: Metadata
Provenance: Content
• Who has had custody/ownership of the
digital object?
10 years on
Authenticity:
• Is the digital object what it purports to be?
50 years on
Preservation Activity:
• What has been done to preserve it? Forever!
Technical Environment:
• What is needed to render and use it?
Rights Management:
• What IPR must be observed?
Makes digital objects self-documenting across time
6. Basics of preservation metadata
Digital preservation concentrates on well-designed formal
systems based on digital library and trusted digital
repository concepts
Information about what needs to be preserved and how are
part of any preservation system
Since items aren’t on shelves, metadata is the only
mechanism for actually keeping or finding anything
3 concepts are important
• Metadata about preservation of digital objects
• Preservation of metadata itself to ensure that content
and metadata is preserved
• Use of metadata in a trusted digital repository
7.
8. PREMIS Data Dictionary
May 2005: Data Dictionary for Preservation
Metadata: Final Report of the PREMIS Working
Group
• Version 2.0 (April 2008)
• Version 2.1 (January 2011)
• Version 2.2 (July 2012)
• Version 3.0 expected 2013
Includes:
Data Dictionary Context/assumptions
Data model Usage examples
Conformance XML schema to support implementation
Data Dictionary:
• Core set of implementable, broadly applicable preservation
metadata semantic units, supported by guidelines and
recommendations for management and use
9. What does PREMIS cover?
Administrative metadata that supports the digital
preservation process
Provides information to help manage a resource
for preservation purposes
• Technical characteristics
• Information about actions on an object
• Relationships (structural and derivative)
• Structural: indicates how compound objects are put
together
• Derivative: results of common preservation actions
• Rights metadata associated with preservation
In OAIS terms:
• Metadata as part of SIP, AIP or DIP
• Fits into Preservation Description Information
(Reference, Context, Provenance, Fixity)
10. What PREMIS is and is not
What PREMIS is:
• Common data model for organizing/thinking about preservation
metadata
• A checklist for core metadata in a repository
• Guidance for local implementations
• Standard for exchanging information packages between repositories
What PREMIS is not:
• Out-of-the-box solution: need to instantiate as metadata elements in
repository system
• All needed metadata: excludes business rules, format-specific
technical metadata, descriptive metadata for access, non-core
preservation metadata
• Lifecycle management of objects outside repository
• Rights management: limited to permissions regarding actions taken
within repository
11. PREMIS Data Model
Intellectual
Entities
Rights
Statements
Objects Agents
Events
12. Intellectual Entities
Set of content that is
considered a single
intellectual unit for purposes
of management and
description (e.g., a book, a
photograph, a map, a
database)
May include other Intellectual
Examples: Entities (e.g. a website that
Rabbit Run by John Updike includes a web page)
(a book) **Has one or more digital
“Maggie at the beach” representations**
(a photograph) Previously not fully described
The Library of Congress in PREMIS DD, but will be in
Website (a website) scope in version 3.0
The Library of Congress:
American Memory Home
page (a web page)
13. Discrete unit of information in
Objects digital form
**Objects are what repository
actually preserves**
Three types of Object:
• FILE: named and ordered
sequence of bytes that is
known by an operating
system
• REPRESENTATION: set of
Examples: files, including structural
chapter1.pdf (a file) metadata, that, taken
together, constitute a
chapter1.pdf + chapter2.pdf + complete rendering of an
chapter3.pdf (representation of Intellectual Entity
a book w/3 chapters) • BITSTREAM: data within a
TIFF file containing header and file with properties relevant
2 images (2 bitstreams for preservation purposes
(images), each with own set of (but needs additional
properties (semantic units): structure or reformatting to
e.g., identifiers, technical be stand-alone file)
Intellectual entity will become
metadata, inhibitors, … ) another level of object
14. Object Example: book in two versions
Intellectual Entity
Da Vinci Code by
Dan Brown
Representation 1
Representation 2
Page image
ebook version
version
File 1: File 2: File N: File N+1: File 1:
page1.tiff page2.tiff pageN.tiff METS.xml book.lit
15. Events
An action that involves or
impacts at least one Object or
Agent associated with or
known by the preservation
repository
Helps document digital
provenance. Can track
Examples: history of Object through the
Validation Event: use JHOVE chain of Events that occur
tool to verify that during the Objects lifecycle
chapter1.pdf is a valid PDF Determining which Events
file should be recorded, and at
Ingest Event: transform an what level of granularity is up
OAIS SIP into an AIP
to the repository
Migration Event: create a
new version of an Object in
an up-to-date format
16. Agents
Person, organization, or
software program/system
associated with an Event or a
Right (permission statement)
Agents are associated only
indirectly to Objects through
Events or Rights
Examples: Not defined in detail in
Martha Anderson (a person) PREMIS DD; not considered
Library of Congress (an core preservation metadata
organization) beyond identification
Dark Archive in the Sunshine
State implementation (a
system)
JHOVE version 1.0 (a
software program)
17. Rights Statements
An agreement with a rights
holder that grants permission
for the repository to
undertake an action(s)
associated with an Object(s)
in the repository.
Not a full rights expression
Example: language; focuses exclusively
Priscilla Caplan grants FCLA on permissions that take the
digital repository permission form:
to make three copies of • Agent X grants Permission
metadata_fundamentals.pdf Y to the repository in
for preservation purposes. regard to Object Z.
20. Semantic units pertaining to Rights
Rights Statement Rights Granted
Rights Statement act
Identifier restriction
Rights Basis termOfGrant
Copyright Information rightsGranted
License Information Linking Object
Statute Information Identifier
Other Rights Information
Linking Agent Identifier
rightsExtension
21. Semantic units pertaining to Agents
Agent Identifier
Agent Name
Agent Type
Agent Note
Agent Extension
linking Event Identifier
Linking Rights Identifier
22. The State of PREMIS
de facto standard for preservation metadata; in some
countries mandated for cultural heritage repositories
Was recognized by winning the Digital Preservation Award
(2005) and was shortlisted for DPC Decennial award for
outstanding contribution to digital preservation (2012)
PREMIS implementations are appearing in many
places, many contexts, many forms
Experimentation has led to changes in the data dictionary
and schema
PREMIS Implementation fairs: attempts to consolidate
implementation experiences, issues, best practices,
23. Key features of PREMIS
Developed through international consensus-making process
Mobilized community to address shared need
Shared solution to a shared need
Implementation neutral
• Makes no assumptions about technology
• Can be flexibly adapted for use across all sorts of
institutions, digital preservation contexts, repository systems
• Allows for extensibility
Supported by Maintenance Activity and Editorial
Committee, under auspices of US Library of Congress
PREMIS is sustained, maintained, and evolved
Extensive outreach to implementer community
Tutorials, guides, implementation fairs, PIG Forum
“Support system” in place for PREMIS implementers
24. PREMIS Maintenance Activity
Web site:
• Permanent Web presence, hosted by
Library of Congress
• Central destination for PREMIS-related
info, announcements, resources
• Home of the PREMIS Implementers’ Group (PIG)
discussion list
PREMIS Editorial Committee:
• Set directions/priorities for PREMIS development
• Coordinate future revisions of Data Dictionary and XML
schema
• Promote implementation
http://www.loc.gov/standards/premis/
25. Implementation resources
Tools:
• XML schema
• PREMIS-in-METS toolbox <http://pim.fcla.edu>
• Controlled vocabularies at http://id.loc.gov
• RDF/OWL ontology for use as Linked Data
Guidelines:
• PREMIS conformance statement
• PREMIS & METS guidelines
Community Working groups on special topics
Others:
• Understanding PREMIS (available in multiple languages)
• PIG Forum
• Implementation Registry
• Tools Registry
26. Some implementers …
DAITTSS (Florida): a preservation repository for the use of the
libraries of the public universities of Florida.
Ex Libris Rosetta: a commercial digital preservation system
supporting
acquisition, validation, ingest, storage, management, preservation
and dissemination of different types of digital objects
National Digital Newspaper Program
Archivematica: comrehensive open-source digital preservation
system
National Archives of Sweden, National Archives of Scotland
Carolina Digital Repository: repository for material in electronic
formats produced by members of the University of North Carolina
at Chapel Hill community.
British Library electronic journal archiving project
For more information see:
• http://www.loc.gov/premis/premis-registry.html
27. Impact
De facto international standard for preservation metadata
• Part of permanent infrastructure supporting digital
preservation
• ISO standardization being considered
Wide applicability means benefits from PREMIS extend to
entire digital preservation community
Ongoing work to revise/update Data Dictionary and create
new supporting resources
• PREMIS is a dynamic resource that continues to generate
new sources of value to implementer community
Stood the test of time:
• Seven years after initial release, is now indispensable part
of digital preservation implementations around the world
• Not surpassed or replaced by other standard or resource
28. URLs, etc.
PREMIS Maintenance Activity:
http://www.loc.gov/standards/premis/
PREMIS Data Dictionary for Preservation Metadata:
http://www.loc.gov/standards/premis/v2/premis-2-2.pdf
Understanding PREMIS:
http://www.loc.gov/standards/premis/understanding-
premis.pdf
PREMIS Implementation Registry
http://www.loc.gov/standards/premis/premis-registry.php
PREMIS Implementers Group list
http://listserv.loc.gov/listarch/pig.html
40. Digital preservation is the
series of management
policies and activities
necessary to ensure the
enduring
usability, authenticity, discov
erability and accessibility of
101. NISO Webinar:
Metadata for Preservation:
A Digital Object's Best Friend
Questions?
All questions will be posted with presenter answers on
the NISO website following the webinar:
http://www.niso.org/news/events/2013/webinars/preservation
NISO Webinar • February 13, 2013
102. THANK YOU
Thank you for joining us today.
Please take a moment to fill out the brief online survey.
We look forward to hearing from you!
Notas do Editor
PREMIS in METS toolbox consists of 3 modules to help implementers: describe (generate PREMIS metadata), convert (between PREMIS and METS), validate (ensure quality metadata)Controlled vocabularies to increase interoperability and consistency of metadataRDF/OWL ontology to allow for interconnection among preservation repositories, facilitate querying the metadata, and incorporate preservation-specific controlled vocabulariesGuidelines available results in quality and consistent metadata through the conformance statement and the guidelines for using PREMIS in METSCommunity working groups on specific topics include: Ontology working group; Environment working group (to amend the data model)– open to the preservation community at large to participatePREMIS Implementers group forum allows for the preservation community to participate in PREMIS development and submit change requests to the ECImplementation registry assists new implementers in planning their preservation systemsTools registry gives implementers tools
PREMIS has had a significant impact in digital preservation activitiesIts wide applicability has resulted in cost savings to institutions developing preservation repositories because they have a standard that can be used by the entire preservation communitiyOngoing work makes it a dynamic resource– it continues to generate new sources of value to the implementer community
Turn everything off. Make your sidebar completely empty and make sure your PC won’t shut off or down.
Who am I????
I am Mom to these 4 beautiful children …<click>
More pertinently, though, I am the Archive Service Product Manager. I have an MA in Library Science. I have been with JSTOR and Portico forever – I started at JSTOR in 1996. I now focus on preservation at Portico and JSTOR.<click – to standards>
Before we begin, I want to share my philosophy on standards.In my opinion, standards do two things really well …<click>
They provide a framework for thinking about a topic and making a plan.Enter the wildernees with a map.<click.
They are also quite valuable as interchange specifications between organizations, or even groups within a single organization.<click>
Fortunately for me, the PREMIS folks seem to agree. PREMIS is about a way to think about preservation metadata. About the elements and units you need to consider.In my talk to day, you aren’t going to see any PREMIS XML.You are going to see, quite a lot about …<click>
The Portico content model and an XML content wrapper that we call PMD or the Preservation Metadata file.It is a pretty direct reflection of our content model and we have at least one PMD file for every item we preserve.Many considerations went into the design of the Portico preservation metadat …<click>
Another is our definiton of preservation which isn on the screen. We spent quite awhile developing this definition and it really helps us focus when making preservation decisions.<click>What is Digital Preservation? Digital preservation is the series of management policies and activities necessary to ensure the enduring usability, authenticity, discoverability and accessibility of content over the very long term. The key goals of digital preservation include:usability – the intellectual content of the item must remain usable via the delivery mechanism of current technologyauthenticity – the provenance of the content must be proven and the content an authentic replica of the originaldiscoverability – the content must have logical bibliographic metadata so that the content can be found by end users through timeaccessibility – the content must be available for use to the appropriate community
Any number of other standards influnceed us, including …<click>
DIDL is a content model. It is very flexible. We almost used it.<click>
Our first preservation metadata file was METS based. We migrated to our new format a couple of years ago.<click>
Of coures …<click>
And, no doubt many others that aren’t on the tip of my tongue at the moment.<click>
When we redesigned our preservation metadata file a couple of years back, we also drew pretty extensively on our experience. You’ll see that refelected in some areas as we talk, for example how we deal with events in our metadat file.<click>
The PREMIS entities and semantic units can be found embedded in the Portico content model, our metadata elements, and also in a system of registries we implement. Registries are a way for us to track things.<click>
A word about identifiers.PREMIS requires unique identifiers on every entity and semantic unit.At Portico we firmly believe in this philosophy and you’ll see through out the presentation, many of the ways in which we use unique identifiers to link between elements of our content model.<click>
We currently preserved a number of disparate things. The have many similarities, but they also have not insignificant differences.<click>
One of our goals is to represent all these disparate content types in one content model and with one set of preservation metadata.We need to manage the archive and the preserved content uniformly.To put this another way, if can’t manage these uniformly, my head my explode. So, one content model …<click>
So the Portico content model is pretty heavily informed by DIDL.Containers contain other containers.Our model is limited to six levels.We have content types, such as e-books, e-journals, and digitized newspapres.<click>
They contain one or more content sets.A content set is just a way for us to bag content together.For example …For the e-journal content type, our content set is the journal.For the e-book content type, our content set is the publisher.For the digitized newspaper content type, our content set is the collection.<click>
Content sets contain one or more Archival Units. These are the units of preservation.For example …For e-journals, the archival unit is the article.For e-books, the archival unit is the book.For digitized newspapers, the archival unit is the issue.<click>
Each archival unit may contain one or more content units.We’d use this technique if the publisher sent us an update to the full item.<click>
Content units contain one or more function units.A funcitonal unit is an intellectualy entity within the item.For example, the page images of an article are a functional unit.Each figure graphic is a functional unit.<click.
Each functional unit can contain one or more storage units (which are essentially files).Say we receive a high res image, low res image, and thumbnail for a single figure graphic.That one figure graphic functional unit would contain four storage units.<click>
At any level of the Portico content model, we can apply these metadata. Into which some of the PREMIS semantic units may be found.<click>
This entire mess of information … content model and metadata are recorded in the file we call PMD or Preservation metadata.<click>
It is a thing of beauty.And, I’m not even an XML geek!<click.
Our PMD files tightly match the content model.This is a snippet of the XML tree of a PMD file.<click>
Archival units …<click.
Contain content units …<click>
Which contain functional units …
Which contain storage units.The higher elements of our content model are encoded in the construction of the archive itself and as metadata attributes and elements within the PMD file.<click>
Per PREMIS, objects often have the following information associated with them.This type of information is pretty deeply embedded in the Portico PMD.<click>
For example, here as a snippet of information about as storage unit (or file).<click>
Among other things, we have an ID for this storage unit.<click>
And a preservation level.<click>
Deeper into the storage unit, we have additional information, including:<click>
The size of the file …<click>
A basic format for the file …<click>
A format status for the file…<click>
I am going to touch very briefly on registries.At Portico, we use registries as a way to consolidate information.In this case, information about formats. <click>
Here we have two files (this is an element within the storage unit element.<click>
These files each have a very specific format name.<click>
That name provides us with significant additional information, found in our format registry. Including a description, the authority and maintenance agencies, the default file extension and more.While we do track PREMIS information on our objects, it is found in a number of different places, from embedded in the content model or PMD files to registries.<click>
Per PREMIS, the importantsementic units for Events are on the screen. Nothing too surprising<click>
That key information for events can be found in three different locations within the Portico preservation metadata file.<click>
This is a processing record.Precossing records are relatively new features for us.When Portico first started, we designed a very flexible system that would allow us to run different elements of our workflow on different machines. As we ramped up, it became clear that our administrative costs would be lower if we limited the number of machines we managed and that we could get much greater throughput running on a single, powerful machine. Originally, we had put all the information about the machine and tools into each event record. But, with experience under our belt it became clear that we could streamline our metadata files by consolidating this information into Processing Records.<click>
Here is a close-up on a processing record and a set of events that reference this processing record.<click>
They are tied together through that unique processing record ID.And this relationship is telling the world that the events within this event set all occurred on the ConPrepLite system in July 2010.<click>
Within our PMD file, events are grouped into Event Sets. These are just a set of events that happened at the same time, for the same conceptual purpose, and are associated with a single processing record.Some of the events we track are above.Nothing too exciting.<click>
Another change we made was to unify the format of our events. Within Portico all events now contain only elements from the above list of possible elements.These were informed by PREMIS and you’ll see a number of similarities.<click>
If you are going to walk away remembering one thing, remember that events (like descriptive and technical metadata) can live on any element within the content model.<click>
These are the semantic units for agents.In general, rights holders are primary agents within a repository.<click>
In addition, however, are repository systems and people that might make changes to the content.<click>
For example, within these three processing records are three agents that touched the content.<click>
Per PREMIS, the important information to consider is on the screen.<click>
<chuckle><click>
At the moment and for Portico our rights statements are relatively straight forward.All of the Portico agreements, at the moment, are similar and thus, we do not have a need to track a variety of different clauses and commitments.We have set up a system where by content will not enter the Portico archive until such time as we have a formal agreement in place and that agreement has been preserved in the archive.<click>
As with many other PREMIS entities, rights entities are embedded within our PMD file.<click>
Every archival unit must reference a specific agreement. That agreement has a unique ID and can be found in the archive.<click>
Questions?<Amy: stay on afterward.><don’t click unless you need to address 2CUL or Xref questions>