These slides provide an explanation of the Memento Framework (time travel for the Web) from the perspective of resource versioning. It also details progress that has been made with deploying the framework since it was first introduced in November 2009, including standardization, development of tools, and advocacy. In addition, it touches upon new challenges (discovery, branding) and announces plans to make transactional Web archiving software available in the course of 2011.
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Memento: Big Leaps Towards Seamless Navigation of the Web of the Past
1. Memento
http://mementoweb.org/
Herbert Van de Sompel
Robert Sanderson
Michael L. Nelson
Big Leaps Towards Seamless Navigation
of the Web of the Past
Memento Update
CNI Task Force Meeting, Spring 2011 1
2. Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento Update
CNI Task Force Meeting, Spring 2011 2
3. Overview of Memento Framework
Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento Update
CNI Task Force Meeting, Spring 2011 3
4. Memento wants to make it easy
to access the Web of the Past.
Memento Update
CNI Task Force Meeting, Spring 2011 4
5. Tate Online Select Date Tate Online
Today March 16 2008 March 16 2008
From
National Archives
Memento Update
CNI Task Force Meeting, Spring 2011 5
6. Memento achieves this by introducing
a uniform version access capability to
integrate the present and past Web.
Memento Update
CNI Task Force Meeting, Spring 2011 6
7. Content Management Systems:
• Designed to be aware of all
versions of a resource;
• Self-contained;
• Variety of proprietary version
mechanisms;
• Versions interlinked using
proprietary mechanisms.
Memento Update
CNI Task Force Meeting, Spring 2011 7
8. World Wide Web:
• Designed to forget about prior
versions of a resource;
• Distributed.
Memento Update
CNI Task Force Meeting, Spring 2011 8
9. There are resource versions on
the Web:
• Content Management
Systems;
• Web Archives;
• Transactional archives;
• Search engine caches.
Memento Update
CNI Task Force Meeting, Spring 2011 9
10. But the Web architecture has a
hard time dealing with them:
• Cannot talk about a resource
as it used to exist;
• Cannot access a prior version
knowing the current one;
• Cannot access the current
version knowing a prior one;
Current approaches are ad hoc
and localized.
Memento Update
CNI Task Force Meeting, Spring 2011 10
11. Memento:
• Regards the Web as a big
Content Management System
• Introduces a uniform
capability to access versions
on the Web;
• Does not build new archives
but leverages all systems that
host versions: Web archives,
Content Management
Systems, Software Version
Systems, etc.
Memento Update
CNI Task Force Meeting, Spring 2011 11
12. Memento’s version access
approach:
• Is distributed: versions may
exist on several servers;
• Uses time as a global version
indicator;
• Is based on the primitives of
the Web: resource, resource
state, representation, content
negotiation, link.
Memento Update
CNI Task Force Meeting, Spring 2011 12
13. Original Resource and Versions
Memento Update
CNI Task Force Meeting, Spring 2011 13
14. Bridge from Present to Past
Memento Update
CNI Task Force Meeting, Spring 2011 14
15. Bridge from Past to Present
Memento Update
CNI Task Force Meeting, Spring 2011 15
16. Memento Framework
Memento Update
CNI Task Force Meeting, Spring 2011 16
17. Multiple Archives
Memento Update
CNI Task Force Meeting, Spring 2011 17
19. Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento Update
CNI Task Force Meeting, Spring 2011 19
20. Significant progress has been made towards
seamless navigation of the Web of the Past.
Memento Update
CNI Task Force Meeting, Spring 2011 20
21. Standardization:
• Standardization process started
via the IETF;
• Interest from IETF and W3C;
• Encouraged by major Web
architects, including: Tim
Berners-Lee, Mark Nottingham,
Michael Hausenblas.
https://datatracker.ietf.org/doc/draft-vandesompel-memento/
Memento Update
CNI Task Force Meeting, Spring 2011 21
22. Memento Clients:
• Several client tools developed
by us and others;
• Add-ons for FireFox
(operational) and Internet
Explorer (experimental);
• Applications for Android
(operational) and iPhone/iPad
(in development);
• Paper in next issue of Code4Lib
Journal.
http://www.mementoweb.org/tools/
Memento Update
CNI Task Force Meeting, Spring 2011 22
23. Memento server support (1):
• Memento-compliant Wayback
software:
• Used by Internet Archive.
• Available to Web archives,
worldwide.
• Please have your favorite
Web Archive install this new
version 1.6!
http://www.mementoweb.org/tools/
Memento Update
CNI Task Force Meeting, Spring 2011 23
24. Memento server support (2):
• Plug-in for MediaWiki
(operational);
• Used on W3C’s main wiki.
• Please install it for your
MediaWiki!
http://www.mementoweb.org/tools/
Memento Update
CNI Task Force Meeting, Spring 2011 24
25. Memento Server Validator
• Server side client:
• Attempts to perform all
Memento actions against a
given URI
• Reports success/failure of
the interactions and
warnings for optional
aspects
• Kept up to date with IETF
Internet Draft
http://www.mementoweb.org/tools/
Memento Update
CNI Task Force Meeting, Spring 2011 25
26. Memento Proxy Support
• Several systems that host
Mementos made Memento-
compliant “by proxy”:
• All major Web Archives that
do not yet run Memento-
compliant Wayback software
• 3,000+ MediaWiki systems,
including Wikipedia
• We want all of these to become
natively Memento compliant!
Memento Update
CNI Task Force Meeting, Spring 2011 26
27. Memento Website:
• Ongoing effort to add
materials that support
understanding and adoption:
• Introduction to Memento
• How to recognize
Mementos, TimeGates,
Original Resources?
• Guidelines for servers that
host Mementos (Web
Archives, CMS, snapshot
archives, etc.)
http://www.mementoweb.org/guide/
Memento Update
CNI Task Force Meeting, Spring 2011 27
28. Funding:
• 2007-2010: US $250K grant
from Library of Congress;
• Approx. 50K on Memento.
• 2010-2011: US $1 Million
follow-up grant from Library of
Congress.
• For: Specification, outreach,
tool development, further
research.
Memento Update
CNI Task Force Meeting, Spring 2011 28
29. Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento Update
CNI Task Force Meeting, Spring 2011 29
30. Memento Time Travel is really powerful.
Time-Series Data via HTTP follow-your-nose.
Memento Update
CNI Task Force Meeting, Spring 2011 30
31. Memento Framework
Memento Update
CNI Task Force Meeting, Spring 2011 31
32. Memento Framework & Time Series
Original Resource: http://dbpedia.org/resource/France
Memento Update
CNI Task Force Meeting, Spring 2011 32
33. Time Travel across DBpedia Versions
Data collected through HTTP Navigation
paper at http://arxiv.org/abs/1003.3661
Memento Update
CNI Task Force Meeting, Spring 2011 33
34. Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento Update
CNI Task Force Meeting, Spring 2011 34
35. Very few Web sites provide a “timegate” link.
Need additional mechanisms to support Discovery.
Memento Update
CNI Task Force Meeting, Spring 2011 35
36. Batch discovery of Mementos: TimeMaps
A TimeMap minimally lists:
• URI and datetime of Mementos known to an archive
• URI of Original Resource
TimeMaps can be aggregated across systems that host Mementos
Memento Update
CNI Task Force Meeting, Spring 2011 36
37. Batch discovery of Mementos: Feed of TimeMaps
• System that host Mementos exposes Feed (e.g. Atom) of
TimeMaps to allow applications to remain in sync with its
evolving Memento collection:
• One Atom entry per Original Resource for which
system hosts Mementos;
• The entry provides a “timemap” link to a
TimeMap for the Original Resource;
• The datetime value of the updated field of the entry
changes when additional Memento for Original Resource
becomes available (i.e. TimeMap changes);
• The ID of the entry is a tag URI based on URI of
Original Resource.
Will be proposed to IIPC
Memento Update
CNI Task Force Meeting, Spring 2011 37
38. Batch discovery of Mementos: robots.txt
• robots.txt file is used by Web servers to convey
crawling policies;
• Add a directive to support discovery of Mementos known to
the server:
• Pointer to a single Memento can suffice as the robot
can crawl on from there
• Mementos allow for discovery of TimeMaps via HTTP
links.
• e.g. jcdl.org hosts snapshot archives of prior JCDL
conferences and adds the following to its robots.txt:
Memento: jcdl.org/archive/2002/index.html
Will be promoted via Internet Draft
Memento Update
CNI Task Force Meeting, Spring 2011 38
39. Batch discovery of TimeGates: robots.txt
• robots.txt file is used by Web servers to convey
crawling policies;
• Add a directive to support discovery of TimeGates known
to the server:
• TimeGates can be on server itself or on external server
• Value for the directive is typcially a regular expression
• e.g example.org could point at TimeGates in its
associated transactional ta.org via robots.txt:
TimeGate: ta.org/timegate/http://
example.org/*
Will be promoted via Internet Draft
Memento Update
CNI Task Force Meeting, Spring 2011 39
40. Discovery of Systems that Host Mementos: Registry/Feed
• Registry of collections of Mementos, e.g. of Web Archives,
Transactional Archives, etc.
• Feed of registry records.
• A registry record details essential characteristics of a
Memento collection.
• cf VOiD collection description for Linked Data.
Will be researched
Memento Update
CNI Task Force Meeting, Spring 2011 40
41. Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento Update
CNI Task Force Meeting, Spring 2011 41
42. Memento can recreate pages using
resources from different archives.
This poses a branding challenge for archives.
Memento Update
CNI Task Force Meeting, Spring 2011 42
43. Current Branding Practice for Web Archives
Page and embedded resources from same Web Archive
Branding
for
page
and
embedded
resources
Memento Update
CNI Task Force Meeting, Spring 2011 43
44. Branding for Web Archives in Memento Mode
Page and embedded resources from various Web Archives
Page
branding
No
branding
No
branding
Will be researched
Memento Update
CNI Task Force Meeting, Spring 2011 44
45. Overview of Memento Framework
Deployment Progress
Memento and Data
Memento and Discovery
Memento and Branding
Alternative Web Archiving Strategies
Memento Update
CNI Task Force Meeting, Spring 2011 45
46. Crawl-based Archives host distinct observations.
Transactional Archives never miss an update.
Memento Update
CNI Task Force Meeting, Spring 2011 46
47. Crawl-Based Web Archives
Observations
For example: Heritrix crawler for Internet Archive
Memento Update
CNI Task Force Meeting, Spring 2011 47
48. Crawl-Based Web Archives
• Collect discreet observations of resources, not their entire
evolution.
• Can be rejected (robots.txt, by user-agent, by host
IP)
• Can be deceived (cloaking, by geo-location, by user-
agent).
• Coverage of particular Web server dependent on crawl-
strategy.
Memento Update
CNI Task Force Meeting, Spring 2011 48
49. Server-Side Transactional Web Archives
Change History
For example: TTApache, PageVault, Vignette Web Capture
Memento Update
CNI Task Force Meeting, Spring 2011 49
50. Server-Side Transactional Web Archives
• Collect all representations served by to-be-archived server.
• To-be-archived server needs to cooperate.
• Incentives e.g. institutional memory, official record of
Web presence.
• Archival coverage restricted by to-be-archived server, does
not include external servers (e.g. embedded resources).
• To be archived server can submit falsified information.
• Archival collection management: what to keep, what not
(e.g. significant changes, deduplication, …).
Memento Update
CNI Task Force Meeting, Spring 2011 50
51. Development of Transactional Web Archive Software
Capture:
• Apache connection filter module (mod_ta) captures URI, headers, body;
• Module POSTs in real-time to transactional archive’s Submit URI.
Submit:
• Java-Grizzly-Jersey submission interface application;
• Berkeley DB metadata store;
• FS store for body and headers.
Memento Update
CNI Task Force Meeting, Spring 2011 51
52. Development of Transactional Web Archive Software
Access:
• Transactional archive natively supports Memento;
• Immediate availability of archived content;
• Export of WARC, e.g. for long-term archiving in other environment.
Development timeline:
• Ongoing development (LANL) and testing (ODU);
• Submit/Access finalized; development focus on collection management.
• Expected release as open source, 3rd Quarter 2011.
Memento Update
CNI Task Force Meeting, Spring 2011 52
53. Memento
http://mementoweb.org/
Herbert Van de Sompel
Robert Sanderson
Michael L. Nelson
Big Leaps Towards Seamless Navigation of
the Web of the Past
Memento Update
CNI Task Force Meeting, Spring 2011 53