Abstract: Dr. Robert Sanderson, a scientist at Los Alamos National Laboratory and visiting scholar at the Graduate School of Library & Information Science for 2009-2010, will present his work on Memento, a technical framework for adding a temporal dimension to web browsing to allow users to access web resources not only as they exist now but also as they existed in the past. After describing the overall system, the talk will focus on the metadata requirements for the TimeMap API that compliant web archives can expose in order to publish their holdings. This talk is sponsored by the GSLIS Metadata Round Table.
The 7 Things I Know About Cyber Security After 25 Years | April 2024
TimeMaps: Metadata for Memento
1. TimeMaps: Metadata for Memento
Herbert Van de Sompel
Robert Sanderson
Michael L. Nelson
Lyudmila Balakireva
Scott Ainsworth
Harihar Shankar
http://www.mementoweb.org/
Memento is partially funded by the
Library of Congress
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
2. Memento wants to make Navigating the Web’s Past Easy
• Problem Statement
• Memento Solution
• Navigation not Search
• API for Web Archives
• Memento Ontology for TimeMaps
http://www.mementoweb.org/
http://groups.google.com/group/memento-dev
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
3. Web Resources have Different Representations over Time
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
5. 3 Issues with Current Access to Archives
1. Access is via a new URI, unknown to the user.
2. People do not like to search for archived resources, and there is no
automated method
3. Navigation in the past is inconsistent:
1. Stuck in single, necessarily incomplete archive
2. Or if not rewritten, URIs lead back to the present
Comment on Popular Science article: http://bit.ly/bWr5gP
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
6. 1. Representations Archived at a Different URI
Sep 11 2001, 20:36:10 UTC Dec 20 2001, 4:51:00 UTC
http://en.wikipedia.org/w/index.php?
http://web.archive.org/web/20010911203610/http:// title=September_11_attacks&oldid=282333 archived
www.cnn.com/ archived resource for http://cnn.com resource for http://en.wikipedia.org/wiki/
September_11_attacks
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
7. 2. Searching is Cumbersome
http://web.archive.org/web/*/http://cnn.com/ http://en.wikipedia.org/w/index.php?
title=September_11_attacks&action=history
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
8. 3. Inconsistent Navigation (Archives Incomplete)
SPACE
Sep 11 2001, 20:36:10 UTC Sep 11 2001, 21:38:55 UTC
http://web.archive.org/web/20010911203610/http:// http://web.archive.org/web/20010911213855/
www.cnn.com/ archived resource for http://cnn.com www.cnn.com/TECH/space/
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
9. 3. Inconsistent Navigation (Can't Stay in Past)
Pentagon
Dec 20 2001, 4:51:00 UTC current
http://en.wikipedia.org/w/index.php?
title=September_11_attacks&oldid=282333 archived http://en.wikipedia.org/wiki/The_Pentagon
resource for http://en.wikipedia.org/wiki/
September_11_attacks3
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
10. Past and Current Web are Not Integrated
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
11. The Web without a Time Dimension
Need to use a different URI to access archived versions of a resource and its current version
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
12. The Web with Time Dimension added by Memento
Memento uses URI of the current version to access archived versions, but qualify it
with datetime, and magically arrive at the correct location.
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
13. The Memento Solution
There are two components to the Memento Solution:
• Component 1: Navigation to an archived resource
via its original resource, by leveraging content
negotiation.
• Component 2: A discovery API for archives that
enables retrieving a list of all archived versions of a
resource for a given URI.
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
14. Content Negotiation in Time
• Many systems support content negotiation for file format
o Your client by default asks for HTML and gets HTML
o But it could get PDF via the same URI
• Memento proposes a new dimension for content negotiation: Time
o Your client by default asks for the current time, and gets it
o But it could get an older version via the same URI
• Can be accomplished with only one new HTTP header in each
direction:
o Accept-Datetime Request for a particular timestamp
o Content-Datetime The returned content’s timestamp
o These exactly mirror existing headers for Format, Language, etc.
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
15. Apr 10 2001, 21:39:30 UTC
current
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
16. Original Mementos
Resource
Apr 10 2001, 21:39:30 UTC
current
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
17. Original
? Mementos
Resource
Apr 10 2001, 21:39:30 UTC
current
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
18. Original TimeGate Mementos
Resource
Apr 10 2001, 21:39:30 UTC
current
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
19. Conneg with TimeGate to Mementos
Original TimeGate Mementos
Resource
Apr 10 2001, 21:39:30 UTC
current
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
20. Link Headers Conneg with TimeGate to Mementos
Original TimeGate Mementos
Resource
Apr 10 2001, 21:39:30 UTC
current
Aug 15 2004, 08:45:27 UTC
Aug 15 2007, 19:21:58 UTC
www.cnn.com web.archive.org
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
21. Link Headers Conneg with TimeGate to Mementos
Original TimeGate Mementos
Resource
wikipedia.org
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
22. The Web with Time Dimension added by Memento
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
23. The Memento Solution
• Component 2: A discovery API for archives that
allows requesting a list of all archived versions held
for a resource with a given URI.
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
24. Why an API?
• Mementos for any given resource are distributed across archives.
(What? Not just the Internet Archive?!)
• In order to get a correct perspective of available Mementos, different
archives need to be consulted.
• Can do by distributed search (slow), or by consulting an aggregator.
• Aggregator and other services need machine readable description of
archives' holdings to select appropriate Memento for request
• Closest in time
• Most reliable representation
• Fastest responding
• (etc)
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
25. WebCitation 13 May 2009 12:28:39
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
26. WebCitation 13 May 2009 12:28:39
Archive-It 14 May 2009 01:18:11
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
27. WebCitation 13 May 2009 12:28:39
Archive-It 14 May 2009 01:18:11
BL Archive 14 May 2009 07:12:45
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
28. WebCitation 13 May 2009 12:28:39
Archive-It 14 May 2009 01:18:11
BL Archive 14 May 2009 07:12:45
Dracos 14 May 2009 13:00:00
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
29. WebCitation 13 May 2009 12:28:39
Archive-It 14 May 2009 01:18:11
BL Archive 14 May 2009 07:12:45
Dracos 14 May 2009 13:00:00
TNA 14 May 2009 18:21:32
And no Internet
Archive…
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
30. TimeMaps
• At most basic: List of URIs of Mementos and their times
• Expressed as Linked Data; a profile of OAI ORE Resource Maps
• Link header from TimeGate and Memento
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
31. Basic ORE Model
Aggregation (Aggr) is a set of web resources (R-1 to R-3), described in RDF or
Atom by a Resource Map (ReM).
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
32. TimeBundles
Resources of Interest in Memento:
• Original Resource
• TimeGate
• Mementos
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
33. TimeGates
• Period(s) that the TimeGate covers
• Which resource is it a TimeGate for
• mem:TimeSpan as can cover multiple distinct periods
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
34. Mementos
• Time Period: valid for or observed over, number of observations
• Metadata: size, format, etc (will come back to the "etc")
• Which resource it is a Memento for
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
35. Serializations
• RDF/XML
• Good for XML parsers
• Turtle, N3 and related
• Good for graph parsers
• RDFa
• Good for web browsers
• Atom
• Good for alerting, feed readers etc (but still embeds RDF)
• New: Link Header format
• Good for real-time applications
• Smaller file size (just the facts, ma'am)
• Easy to implement with existing link header parsers
• Servers need to produce format anyway, so non-rdf way out
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
36. Use Case: Aggregator using TimeMaps
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
37. Link Headers Conneg with TimeGate to Mementos
Original TimeGate Mementos
Resource
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
38. Metadata Discussion Points
1. What metadata is necessary to determine the most appropriate copy?
• Distance to requested time most important
• Quality of representation?
• Usage statistics for Original Resource? For Memento?
• User tagging of Memento for quality?
• Archive response speed?
• Need to know more information from user preferences?
2. What other metadata is useful and available?
• Crawling archives have limited information
• CMS systems have much more
• User tags, comments, annotations
• Semantic information about content, eg title, author, subject
• Distribution of changes over time
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
39. Metadata Discussion Points
3. What metadata is necessary for inter-archive synchronization?
• Deduplication information: digests, request headers
• "Significant Change" factors
• Crawler settings: respect no-cache, robots.txt etc
4. What metadata can be generated by other services?
• Open World Model: Anyone can say anything about anything
• Technical metadata easy (MIX for images, etc)
• Time Series Analysis interesting (techtales.org)
• Machine Learning based approaches?
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
40. Thank You
Rob Sanderson:
• azaroth42@gmail.com
• rsanderson@lanl.gov
This presentation:
• http://www.slideshare.net/azaroth42/xxx
Memento:
• http://www.mementoweb.org/
• http:groups.google.com/group/memento-dev
MementoFox:
• https://addons.mozilla.com/en-US/firefox/addon/100298
aka: http://bit.ly/memfox
Memento Enables Navigating the Past Web
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
41. Discussion Questions
1. What metadata is necessary to determine the most appropriate copy?
2. What other metadata is useful and available?
3. What metadata is necessary for inter-archive synchronization?
4. What metadata can be generated by other services?
TimeMaps: Metadata for Memento
GSLIS Metadata Group, UIUC, 14th July 2010
42. Appendix: Memento HTTP Flow
HEAD R, (Accept-Datetime)
LinkG
GET G, Accept-Datetime
302M, Vary, TCN, LinkR,B,M
GET M, (Accept-Datetime)
200, Content-Datetime, LinkR,B,M
43. Memento HTTP
Memento HTTP Flow
Flow
HEAD R, (Accept-Datetime)
LinkG
GET G, Accept-Datetime
302M, Vary, TCN, LinkR,B,M
GET M, (Accept-Datetime)
200, Content-Datetime, LinkR,B,M
44. Memento HTTP
Memento HTTP Flow
Flow: URI-R
HEAD R, (Accept-Datetime)
HEAD http://cnn.com/ HTTP/1.1
Host: cnn.com
Accept-Datetime: Tue, 11 Sep 2001 20:35:00 GMT
Connection: close
45. Memento HTTP
Memento HTTP Flow
Flow
HEAD R, (Accept-Datetime)
LinkG
GET G, Accept-Datetime
302M, Vary, TCN, LinkR,B,M
GET M, (Accept-Datetime)
200, Content-Datetime, LinkR,B,M