“Who does forever?” : A Registry of Keepers
Who is looking after e-journals with archival intent?
2. Dr Who and the Scholarly Record
Time Travel for Scholarly Web
Evidence from the Keepers Registry
Statistics on who is looking after what, & what is at risk
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
Tales from the Keepers Registry: Dr Who and the Scholarly Record
1. Association of Subscription Agents & Intermediaries
ASA ANNUAL CONFERENCE 2014
24-25 February 2014
Tales from the Keepers Registry
Dr Who and the Scholarly Record
Peter Burnhill
EDINA, University of Edinburgh, UK
http://creativecommons.org/licenses/by/3.0/
2. Overview
1. “Who does forever?” : A Registry of Keepers
Who is looking after e-journals with archival intent?
2. Dr Who and the Scholarly Record
Time Travel for Scholarly Web
3. Evidence from the Keepers Registry
Statistics on who is looking after what, & what is at risk
3. Some Consequences of Web
• Essentials of supply chain have changed
• licensed to access, not sale of content
• Libraries no longer take physical custody of much
“The Library [Committee], which
key content made up of librarians and
is
• online academics,
remotely, not on-shelf locally
• Role of libraries as reassurance about
… wants trusted keepers of information
long-term disrupted
and culture has beenpreservation before
confirming a University policy
– Need assurance of continuity of access
of goinge-only.”
• of all content for future generations
from email sent by a the licence
• of the back copies, post-cancellation of big UK Library
• Does this mean that the Scholarly Record is at risk?
4. 1. “Who does forever?”
Many reports over past 10 years highlighted risks
•
„digital decay‟: format obsolescence & bit rot
and warned against single points of failure:
•
•
natural disasters (earthquake, fire and flood)
human folly (criminal and political action): hacking
+ risks with commercial events in the publisher/supply chain
Some early archiving initiatives emerged …
•eDepotat KoninklijkeBibliotheek
• international significance(Elsevier &Kluwer)
•the LOCKSS project at Stanford University
• from which came CLOCKSS [as library/publisher „dark archive‟]
•the electronic-archiving initiative at JSTOR
• from which came Portico[as service provider]
5. A „global challenge‟: trans-national action
UK.BL 10%
Netherlands
& Germany:
c. 4.5% each
„hidden‟ e-journals:
low % ISSN
US.LoC 20%
Brazil 4%
%age of the 113,000 ISSN issued for e-serials
Researchers (and therefore libraries) in any one country
are dependent upon content written and published
in countries other than their own
6. A Variety of „Archiving Organisations‟
① web-scale
not-for-profit
archiving agencies
e.g. CLOCKSS Archive & Portico
② national libraries (with legal deposit in mind)
e.g. e-Depot (Netherlands); British Library;DnB etc
③
research libraries: consortia & specialist centres
e.g. Global LOCKSS Network, HathiTrust,
Scholars Portal, Archaeology Data Service
Disclaimer:
University of Edinburgh is a CLOCKSS Node & Board Member:
Jisc supports UK LOCKSS Alliance
7. How can we know who is looking after what & how?
(and uncover what is still at risk)
SERVICES: user
requirements
E-J Preservation Registry Service
Data
dependency
E-Journal
Preservation
Registry
(b)
The Keepers Registry,
product of Jisc-funded
PEPRS Project
(EDINA & the ISSN IC)
METADATA
on preservation action
(a)
METADATA
on extant e-journals
ISSN Register at heart
of the Data Model;
ISSN-L as kernel field
ISSN
Register
(Taken from Figure 1 in reference paper in Serials, March 2009)
Digital Preservation
Agencies
e.g. CLOCKSS, Portico; BL, KB;
UK LOCKSS Alliance etc.
8. How can we know who is looking after what & how?
(and uncover what is still at risk)
SERVICES: user
requirements
E-J Preservation Registry Service
Data
dependency
E-Journal
Preservation
Registry
(b)
The Keepers Registry,
product of Jisc-funded
PEPRS Project
(EDINA & the ISSN IC)
METADATA
on preservation action
(a)
Look forward to
ISNI for publisher
as kernel field
ISSN Register at heart
of the Data Model;
ISSN-L as kernel field
METADATA
on extant e-journals
ISSN
Register
(Taken from Figure 1 in reference paper in Serials, March 2009)
Digital Preservation
Agencies
e.g. CLOCKSS, Portico; BL, KB;
UK LOCKSS Alliance etc.
9. Many archiving organisations is a Good Thing
“Digital information is best preserved by replicating it at
multiple archives run by autonomous organizations”
B. Cooper and H. Garcia-Molina (2002)
10. Now have a global Registry of e-journal archiving
… to discover who is looking after what
Enter title
or ISSN
to search across metadata
reported by leading
archiving organisations
*news*
Library of Congress has now joined the Keepers Registry
[& have high hopes for some others …]
11. … and discover details of its „archival status‟
This e-journal is being archived
by 5 archiving agencies …
… but coverage
of volumes is
partial & patchy
Example search: „Origins of Life’
11
12. Overview: Time for Part 2
2. Dr Who and the Scholarly Record
(Time Travel for Scholarly Web)
• ‘Reference Rot’: When what was referenced and cited
ceases to say the same thing, or ‘has ceased to be’
http://www.snorgtees.com/this-parrot-has-ceased-to-be
13. The „reference rot‟ problem definition
Investigating Reference Rot in Web-Based Scholarly Communication
1. http:// link to a resource no longer works
•
Link rot
2. The citation is inadequate
•
Not robust over time
3. The content referenced at the end of the link
a) has evolved,
b) changed dramatically,
c) disappeared completely.
http://hiberlink.org #hiberlink
14. Hiberlink Project: Andrew W. Mellon Foundation
Investigating Reference Rot in Web-Based Scholarly Communication
Partners
• Los Alamos National Laboratory: Research Library
•
Martin Klein, Robert Sanderson, Herbert Van de Sompel
• University of Edinburgh: EDINA&Language Technology Group
•
Peter Burnhill, Neil Mayo, Muriel Mewissen, Christine Rees, Tim Stickland,
Richard Wincewicz&Beatrice Alex, Claire Grover, Richard Tobin, Ke „Adam‟ Zhou
Acknowledgments
• Primary datasets: arXiv, Chesapeake Project, Elsevier, PubMed Central,
PLoS, … Planning on large-scale investigation (looking for more …)
• Secondary datasets: Ex Libris, MS Academic, SerialsSolutions
• Technology support: CrossRef Labs, CrossRef Prospect, Elsevier
• Liaisons: archive.is, CrossRef, Internet Archive, Old Dominion University
Web Science & Digital Library Research Group, perma.cc
http://hiberlink.org #hiberlink
15. Hiberlink Project: Four work packages
Investigating Reference Rot in Web-Based Scholarly Communication
1. Problem Quantification. text mining of vast corpus of
scholarly literature to uncover references to web resource (URIs);
using Memento; determine availability on live web and in archives.
2. Archival Solution Infrastructure. Prototyping proactive, web-centric archiving approaches mechanisms for archiving
cited web resources at the point of use or publication.
3. Temporal Reference Solutions. Prototyping new methods of
citation to enable creation of precise & actionable time-specific
references.
4. Dissemination and Outreach. Raising awareness of the
challenges at the heart of digital scholarly communication.
http://hiberlink.org #hiberlink
16. Investigating Reference Rot in Web-Based Scholarly Communication
References in Web-Based Scholarly
Communication
References to other online
scholarly works
Link Rot
DOI, HTTP version of DOI
Content
Decay
References to online
resources on the „wider Web‟
Fixity of content
Archiving: CLoCKSS,
LoCKSS, Portico…
(Keepers Registry)
This is becoming
understood but issues, see
This is unexplored, so
to be Hiberlink focus
David Rosenthal blog post http://blog.dshr.org/2013/11/patio-perspectives-at-anadp-ii.html
17. Articles Increasingly link to online resources
on the „wider Web‟
URIs extracted from PubMed papers – links to Web at Large resources
18. Quantifying the extent of „Reference Rot‟ – Early Results
Using: PubMed Central Corpus 01/1997 - 12/2012
•
•
•
•
Articles processed:
Articles that contain links (URIs) to „Web at Large‟ :
Number of references to „Web at Large‟ URIs:
Unique referenced Web at Large URIs:
494,785
176,527
557,432
327,782
Percentage Exists & Archived Referenced URIs
31.2%
Exists & Archived
!Exists & Archived
Exists & !Archived
!Exists & !Archived
16.8%
11.3%
40.7%
31%
11%
41%
17%
are available & safe
can be retrieved
at risk
are lost
19. Thoughts on How to Address Content Decay
Who is not selling defective goods?
• Remedy: Pro-active approach to trigger web
archiving when web-based content is referenced
in scholarly work:
– By authors
• during note taking, authoring, when submitting
– By publication platforms
• During submission, editing, acceptance, issue
20. + Tool with Temporal Context for Links
• Memento for Chrome is an application that uses Original URI-R and
dates to access Mementos in various web archives
Memento Time Travel for Chrome
http://bit.ly/memento-for-chrome
21. BackTo The Overview - Part 3
E-journals should be easy
– right?
… but is the e-journals
problem is being solved?
3. Evidence from the
Keepers Registry
Statistics on who is looking
after what, & what is at risk
22. 3. Evidence from the Keepers Registry
a)
21,557 e-serial titles are reported as being
ingested by the 10 Keepers
– organisations with archival intent
– with many „missing volumes and issues‟
b) 113,092 ISSN assigned to „online serials‟ in the
ISSN Register
Progress with a key indicator: ratio of a/b = 19%
– was 17% at close of 2011 (16,558 / 97,563)
Progress, but far from „job done‟
23. Do we need to agree a „priority list‟ of titles?
1. Should we only be interested in the c.30,000 „peer-reviewed‟
scholarly journals? [Ulrich‟s]
2. Do we look only at on what individual libraries list?
– In 2012 we checked „archival status‟ for 3 large university libraries
c.75%
„at risk‟
c.11%
held by
3 or more
• Two key indicators: %age (& number) of titles that are „at risk of loss‟
%age (& number) titles that are ‘preserved by 3 or more Keepers’.
1. Should we ask the audience?
•
The researchers and students who read online serials
24. Looking from the user‟s point of view …
… with usage logs for the UK OpenURL Router
• 10.4m full text requests in 2012; ISSN-L to de-duplicate ISSN
• 53,311 online titles requested by researchers & student from 108/160+
Analysis using the Keepers Registry:
• Only 15% (7,862) are being kept by 3+ Keepers
• Over two thirds (68%) held by none
36,326 titles „at risk‟ of loss
So „preservation‟ (or lack of it) is still a real
and present problem!
25. Good News & Main Challenge?
Good news?
• Most of the big publishers engage with archiving initiatives
– typically CLOCKSS, e-Depot and Portico.
• Are those titles, volumes & issues actually being archived?
Main challenge?
• Long tail of smaller publishers - regardless of business model.
• Everyone in the audience should check whether they are
participating in at least one preservation approach?
• Role of Agents, who arrange subscriptions with those
small publishers?
– Or only role of national libraries & research library consortia?
26. Choice of future with 2020 Vision
•
Best Case scenario for ASA 2020
– Libraries, Agents & Publishers have acted to reduce that
alarming 80% figure to near to zero
– They have ensured that all the e-journal content used by
their researchers in 2013 has been preserved and can
be successfully used in 2020, and assuredly beyond.
•
Worst Case scenario for ASA 2020
– Libraries, Agents & Publishers have failed to act
– Important literature has been lost
– Citizens & scholars complain of neglect!
27. The Keepers Registry: Actionable Evidence
Sidebar note on monitoring their progress …
1. To assist publishers „do the right thing‟
–
A showcase for the real heroes: the archiving organisations
–
Means to check what content is being reported as archived
–
Provide libraries, publishers & archiving organisations with lists of
titles that seem to be at risk of loss
2. To keep a close focusBreaking News:
on volumes & issues
Need New release (end of Q12014) Members Area:
for Publishers & Libraries to make sure all issued content is
being kept safe
Upload a list of ISSNs& get back archival status of Titles
3. To assist collaboration between Keepers: „a safe places network
Access to API, to report archival status on 3rd Party websites
–
4. If it is worth preserving, it really should have an identifier
28. Gentle Wake-up Call to Ensure Continuity of Access
‘Go Smell The Coffee’
#hiberlink
http://thekeepers.blogs.edina.ac.uk/
http://thekeepers.orghttp://hiberlink.org/
29. Ask a librarian in 2020: 3 possible answers
1. "Yes, we have it (we've checked recently, both in the
catalogue and in actuality), and you can access it now"
2. "No, but we know some body that does (we trust),
– so we can point you to (or arrange access to) it now/soon-ish"
3. "Sorry, we don't know …
- perhaps nobody has it
- it may be lost forever, altho' perhaps somebody somewhere ...”
- That was true for the print world
- Unfortunately, unless we do something now, the 3rd answer
could become the common one for a lot of e-journal content
30. Sidebar note on National Libraries
Should we wait upon Legal Deposit?
– 94% of libraries have some form of legal deposit for print.
• Only 44% national libraries had legislation in 2011 for e-books or
e-journals; expected to rise to 58% by June 2012.
from presentation, CENL 2011 Survey by Lynne Brindley
to CDNL Annual Meeting Puerto Rico, 15/8/11
• Only 27%[expected to rise to 37% by June 2012] actually ingesting via legal
deposit
Total national libraries collecting = those 14 via legal deposit
+ 9 by other means (Netherlands, UK/BL, Switzerland voluntary deposit)
Only KB e-Depot, BL, NSLC (+ LoC) in The Keepers Registry
Only when the other 19 join will all know about their activity
Key point is not about call for „legal deposit‟ but that on its
own it is taking too much time
E-journals, and online serials more generally, are a big part of the scholarly record – if we use the distribution of assigned ISSN as a guide, then we have some measure of just how international is the problem space:- centering the map on Singapore, Asia and the Pacific for a change yes a lot is published in US, UK, Netherlands and Germany – but over 60% is not – and that is an underestimate because so many online serials in countries in the centre of this map do not have ISSN assigned – they remain hidden to our arithmetic.
by considering 3 answers that might be given to a user asking after a particular e-journal content – typically an article