A presentation of the work I had done with the Research Library Prototyping Team at Los Alamos National Laboratory given to the local chapter of the Special Libraries Association in New Mexico.
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Reference Rot
1. Reference Rot
Los Alamos National Laboratory
Research Library Prototyping Team
Presented by Shawn M. Jones
2. Citations are the building blocks
of scholarly communications
Citations
Provide
Support and Evidence
+
Experiment and Results
Argument
3. DOIs Identify Scholarly
Publications
• Almost all scholarly publications (papers, articles, etc.) have an
associated Digital Object Identifier (DOI) maintained by CrossRef
• DOIs are persistent
• If a publisher changes ownership or sells part of its catalog, the DOI remains
with the publication so that scholars can continue to find the paper into the
future
"ISO 26324:2012(en), Information and documentation — Digital object identifier system". ISO.
4. URIs Identify Web Resources
The World Wide Web consists of resources, such as pages or
applications.
Each web resource is identified by a Uniform Resource Identifier (URI).
Examples of web resources:
• Web pages
• Google Search
• Software Web Sites
Each resource may have one or more representations that vary by
dimensions such as language or document format.
Uniform Resource Locators (URLs) are a subset of URIs that require a
web location (a server with an application or directory structure).
Architecture of the World Wide Web, Volume One (15 December 2004) edited by Ian Jacobs, Norman Walsh. https://www.w3.org/TR/webarch/
5. Scholars use URIs in References to
Web Resources
• The web resources behind URIs have no guarantee of
persistence, they can disappear because:
• Their website is gone due to lack of funding
• An organization changes its website and doesn’t provide redirects to
old resource
• And more…
6. Why use URIs?
• Existing publications are not the only
supporting evidence in scholarly work
• URIs are invaluable to researchers, it
allows them to cite:
• Software Projects
• Datasets
• Affiliation Web Sites
• Funding
• Scholar Web Sites
• Blog Posts
• Technical Reports
• Evidence such as news stories or Tweets
• And more…
7. Consider The Publication of the Paper
and the Reader In the Future Following
One of Its References
The paper is published at some point, and its citations using URIs were good at that time.
Will they be good for a reader in the future?
8. Reference Rot Problem #1:
Link Rot
The reader follows a reference and it is gone
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers
from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253
This web-at-large resource is linked from the scholarly article Generalizing the OpenURL
Framework beyond References to Scholarly Works but it is now gone!
9. Reference Rot Problem #2:
Content Drift
The reader follows a reference and it is not the same
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI References
Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
This web-at-large resource is linked from the scholarly article Searching for Quantum Gravity with
High Energy Atmospheric Neutrinos and AMANDA-II but it has changed since publication.
10. A Potential Solution:
Web Archives!
Web Archives make snapshots of web resources
so users can go back and look at a web page as it
was in the past.
There are many web archives, such as:
• Internet Archive
• Perma.cc
• Archive.is
• Icelandic Web Archive
• UK Web Archive
• Library of Congress
These snapshots are called mementos.
11. Questions
Addressed By
Our Research
1. Is the use of URI references on the
rise?
2. To what extent does link rot exist in
scholarly URI references?
3. To what extent does content drift
exist in scholarly URI references?
4. What can we do about reference
rot? Can Web Archives help?
5. When are people using URIs when
they should be using DOIs?
6. What can we do to ensure people
use DOIs when they exist?
12. Dataset
• 1.8 million articles from arXiv, Elsevier, and PubMed Central from 1997 to 2012
• For content drift comparison, Mementos are taken from 18 web archives
• The data was processed by the University of Edinburgh and Los Alamos National
Laboratory
• From these articles we extracted 1.06 million URI references
14. The Number of URI References
Goes Up Each Publication Year
Articles and URI references per
publication year - arXiv corpus.
Articles and URI references per
publication year - Elsevier corpus.
Articles and URI references per
publication year - PMC corpus.
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers
from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253
15. To what extent does link rot
exist in scholarly URI
references?
16. Link Rot for References Gets Worse
As We Look At Older Publications
arXiv corpus Elsevier corpus PMC corpus
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers
from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253
If a URI Reference no longer respond, then we have link rot.
17. Fewer Publications Are Immune
to Reference Rot
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five Articles Suffers
from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253
Immune publications have no URI references
Healthy publications have no link rot and have
mementos within 14 days of publication for all of their
references
Infected publications have link rot or have no
mementos for all of their references
As noted before, more and more publications use URI
references
18. To what extent does
content drift exist in
scholarly URI references?
19. Because of Web Archives, We Can
Study Content Drift
This Page Changed Much over 3 Months
This Page Hasn’t Changed in 19 Years
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI
References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
20. The Frequency Of Memento Creation
Is Not the Same for All Resources
Archived Regularly
Archived Occasionally
Archived Once
Archived Never
21. Step 1: Find a memento of a
reference from the publication date
of the paper
If a memento before the publication date and after the publication date match according to 4 similarity
measures, we consider the two to be the same and either is representative of the reference as it existed at the
time of publication.
Representative mementos get compared with the current live version of the same reference in step 2.
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI
References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
22. Many References Do Not Have
Representative Mementos
arXiv Corpus Elsevier Corpus PMC Corpus
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI
References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
23. Step 2: Compare the memento of the
reference with the web resource
from now
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI
References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
Using the same 4 similarity measures, we compare the content of the current resource with the content of the
representative memento.
24. Content Drift Is Worse For Older
Publications
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI
References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
arXiv corpus PMC corpusElsevier corpus
25. What can we do about
reference rot? Can Web
Archives help?
26. What can we do about reference
rot? Can Web Archives help?
1. Scholars can pro-actively create
mementos in web archives for URI
references
• The Internet Archive’s “Save Page Now”
• Perma.cc, Archive.is, and Web Cite exist
for this purpose
• Mink, Webrecorder.io
2. Other scholars/editors can reference
these snapshots in scholarly literature
• Robust Links
• Memento
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI
References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
28. Are people using URIs in references
when they should be using DOIs?
Van de Sompel H, Klein M, and Jones SM. 2016. Persistent URIs Must Be Used To Be Persistent. In Proceedings of WWW
2016, pp. 119-120. DOI: 10.1145/2872518.2889352
arXiv corpus PMC corpus
We hypothesize that this is caused by citation software using the URI instead
of the DOI because it does not know the DOI.
29. Problem: Machines Just See Links,
Where Is The DOI?
Links
Links
Link
Link
Links
Links
Links
Links
Links
Links
URI
30. Humans Can Get Meaning from
Links on a Web Page
Authors
DOI
Bibliographic
Metadata
PDF Document
31. Problem: Machines Cannot Find
the DOI
Browsers and citation software can easily
access the URI; it indicates how to retrieve the
resource.
The DOI is buried in the text of the landing
page.
Citation software must be programmed with
many publishers’ templates in order to find
the DOI across all resources. Publishers also
change their templates, causing software to
break.
Some publishers do not use the DOI in their
EndNote/BibTeX citations.
32. What can we do to
ensure people use DOIs
when they exist?
33. HTTP Already Has A Solution, We
Just Need to Use It
• HTTP is the protocol of the web
• Before HTTP sends content, it sends headers
• Inside these headers, publishers can use the Link header to reference other
content
• Because the metadata is stored in the transfer protocol:
• This solution requires no change to the content, meaning it works with any document
format.
• This solution can be applied to existing content with no change to the content.
HTTP/1.1 200 OK
Date: Mon, 17 Jul 2017 17:53:54 GMT
Server: Apache/2.2.3 (Red Hat)
Connection: close
Link: <http://doi.org/10.101010/99999999>; rel=“identifier”
Content-Type: text/html; charset=UTF-8
Van de Sompel H and Nelson ML. (2015) Reminiscing About 15 Years of Interoperability Efforts. D-Lib 21: 11/12. DOI: 10.1045/november2015-
vandesompel
34. Using the HTTP Link Header, the
machine can find the DOI
Using the HTTP link header, publishers can provide metadata
linking to the DOI from their resources.
This way, a browser or citation manager can find the DOI if
they are currently on the landing page or the PDF page.
This effort is named
“Signposting the Scholarly Web”.
35. Signposting is not just for DOIs
• Why not link from
the document’s
landing page to the
author’s ORCID?
36. Signposting is not just for DOIs
• Why not link from
the document to the
metadata?
37. Signposting is not just for DOIs
• Why not link from
the landing page to
supplemental items
or other publication
formats?
40. Scholarly URI References In Jeopardy
• URIs identify web resources and
are not persistent
• Link rot and content drift are
problems for URI references and
get worse for the older the
publication is
• Scholars sometimes use URIs
instead of DOIs when creating
references, even if DOIs exist
41. New Hope for Scholarly References
• Web Archives play a role in
preserving references
• We can use a variety of tools
to create mementos of
references at the time of
publication
• We can access them with
Memento and Robust Links
• We can use signposting to
help reference managers and
other tools find DOIs and
other information
42. Thanks for listening
Klein M, Van de Sompel H, Sanderson R, Shankar H, Balakireva L, et al. (2014) Scholarly Context Not Found: One in Five
Articles Suffers from Reference Rot. PLOS ONE 9(12): e115253. DOI: 10.1371/journal.pone.0115253
Jones SM, Van de Sompel H, Shankar H, Klein M, Tobin R, et al. (2016) Scholarly Context Adrift: Three out of Four URI
References Lead to Changed Content. PLOS ONE 11(12): e0167475. DOI: 10.1371/journal.pone.0167475
Van de Sompel H, Klein M, and Jones SM. 2016. Persistent URIs Must Be Used To Be Persistent. In Proceedings of WWW
2016, pp. 119-120. DOI: 10.1145/2872518.2889352
Van de Sompel H and Nelson ML. (2015) Reminiscing About 15 Years of Interoperability Efforts. D-Lib 21: 11/12. DOI:
10.1045/november2015-vandesompel
http://robustlinks.mementoweb.org
http://signposting.org http://timetravel.mementoweb.org