Presentation for PIDapalooza 2019, Dublin, Ireland.
The Scholarly Orphans project, funded by the Andrew W. Mellon Foundation, explores technical approaches aimed at capturing and archiving scholarly artifacts that researchers deposit in web productivity portals as a means to collaborate and communicate with their peers. These artifacts are not collected by other frameworks aimed at archiving the scholarly record (e.g., LOCKSS, Portico, Institutional Repositories) and are only incidentally captured by web archives. The project explores an institution-driven approach inspired by web archiving. To demonstrate the ongoing thinking, the project has devised an experimental automated pipeline that continuously discovers, captures, and archives artifacts. These are created by actual researchers who, for the purpose of the experiment, were virtually enlisted in a fictive research institution. A portal at myresearch.institute provides an overview of the artifacts that were discovered and provides access to archived versions stored in both an institutional and a cross-institutional archive. The set-up leverages a range of technologies that share a flavor of persistence: Memento, Memento Tracer, Robust Links, Signposting.
SCM Symposium PPT Format Customer loyalty is predi
To the Rescue of Scholarly Orphans
1. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
To the Rescue of Scholarly Orphans
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
2. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Scholarly Orphans Team
• Los Alamos National Laboratory:
• Lyudmila Balakireva
• Martin Klein
• James Powell
• Harihar Shankar
• Herbert Van de Sompel
• Old Dominion University:
• Sawood Alam
• Grant Atkins
• Shawn Jones
• Mat Kelly
• Michael L. Nelson
4. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
• Consideration
• Researchers deposit artifacts in web platforms
• Web Platforms:
• Dedicated to scholarship:
• Commercial: e.g., FigShare, Publons
• Not for profit: e.g., OSF, Zenodo
• General purpose:
• Commercial: e.g., GitHub, SlideShare
• Not for profit: e.g., Wikipedia, Wikidata
Research and Research Communication on the Web
5. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - The researchers’ institutions are in the dark
• Do not know about the existence of these artifact
• Do not have a copy of these artifacts
Research and Research Communication on the Web
6. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo – Uncertainty regarding long-term access
• Commercial: changing business model, no preservation commitment
• Not for profit: unpredictable funding stream
Research and Research Communication on the Web
7. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - Not systematically archived
• No frameworks like LOCKSS/Portico exist for these artifacts
• Researchers only selectively deposit artifacts in portals that
provide archival guarantees; to obtain a cite-able DOI
• Can’t expect researchers to (also) upload all artifacts in IRs
• Web archives only incidentally archive these artifacts, cf.
Hiberlink research
Research and Research Communication on the Web
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
8. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Scholarly Orphans – Project Overview
How to capture Scholarly Orphans for long-term archiving?
9. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
The Scholarly Orphans Project
• Explores an institution-driven paradigm
• Academic institutions typically have a long shelf life
• A basic premise underlying e.g., LOCKSS, perma.cc
• An academic institution should be interested in capturing the
artifacts (intellectual property) its scholars deposit on the web
• Collecting and archiving such artifacts aligns with the
mission of academic libraries
11. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
The Scholarly Orphans Project
• Explores a paradigm inspired by web archiving
• Scale of the problem
• Can’t expect researchers to upload all artifacts in an institutional
repository
• Bilateral agreements for archival purposes with most web
portals unlikely
17. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Tracking Artifacts - Description
• In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
• The web identity of the researcher in the portal
• Algorithmic discovery
• Discovery via a registry
• Manual collection
• A portal API that supports:
• Access by web identity
• Access to contributions “since …” for the web identity
• Result of tracking:
• URI(s) of new artifact(s) discovered in the portal
18. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Tracking Artifacts - Challenges
• Portal API access by web identity
• Broadly supported by general purpose portals
• Typically not supported by scholarly portals
• Some lack an API altogether
• Should add ORCID access to APIs
• OAI-PMH and ResourceSync need sets per web identity
20. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts - Description
• The capture process takes as input the URI of a new artifact
discovered in a portal
• Its task is to create a representative institutional capture of the
artifact
• Result of capture:
• WARC file for new artifact in an institutional archive
21. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts - Challenges
• Delineate the web boundary of the artifact
• More than the input artifact URI
• The boundary is in the eye of the beholder
• Create a high-fidelity capture using an approach that scales for a
steady stream of new artifacts
• Unsolved problem
• We made a significant breakthrough with the Memento Tracer
framework
Memento Tracer: http://tracer.mementoweb.org
24. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Archiving Artifacts - Description
• The archiving process takes as input the URI of a WARC file
generated by the capture process
• Its task is to ingest the WARC file in a cross-institutional web archive
• This can be achieved using off-the-shelf web archiving software,
e.g., pywb, Open Wayback
• Result of archiving:
• Mementos pertaining to newly discovered artifact in a cross-
institutional, Memento-compliant web archive
• Possibility to link to artifacts using Robust Links:
<a href=“URI-A”
data-versionurl=“URI-M”
data-versiondate=“date-of-capture”
Robust Links: http://robustlinks.mementoweb.org/about/
25. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Archiving Artifacts - Challenges
• Attempted to use ipwb, a pywb version that uses IPFS
• Cross-institutional distributed file system with redundancy
• Ran out of time to get it operationally stable
Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive
https://doi.org/10.1145/2910896.2925467
27. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
myresearch.institute - Researchers
• Uniquely identified by ORCIDs
• Web identities in multiple portals
• Create various types of artifacts
28. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
myresearch.institute - Portals
• Tracking started August 27 2018
• Tracking artifacts created starting
August 1 2018
• 9000+ artifacts tracked to date
for all 16 researchers
32. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Summary (1/2)
• The Scholarly Orphans project explores an institution-driven
approach to capture scholarly artifacts deposited in web portals
• Artifacts out of scope of existing archival approaches such as
LOCKSS, Portico, web archives
• Institutions have a long shelf life, should be interested in
collecting these artifacts, and have feasible scale for
identity/artifact discovery
33. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Summary (2/2)
• Components of the experimental pipeline:
• Tracker: Automatically discover artifacts because researchers
will not upload them to the institution
• Capturer: High fidelity artifact captures through crowd-sourcing
navigation patterns with Memento Tracer
• Archiver: Cross-institutional, Memento-compliant scholarly web
archive
34. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
To the Rescue of Scholarly Orphans
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
35. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
No Long-Term Access Guarantees
“We reserve the right at any time and from time to time to modify or
discontinue, temporarily or permanently, the Website (or any part of
it) with or without notice.
GitHub Terms of Service
http://help.github.com/articles/github-terms-of-service
https://help.github.com/articles/github-terms-of-service/
36. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Funding Not for Profits
FAQ 19: How is arXiv funded and governed?
Answer: It is currently supported by funds from a network of
member libraries, the Simon’s foundation and financial support,
labor, and infrastructure provided by the Cornell University Library.
The annual membership fees depend on the institutional usage
and range from $1,500 to just $3,000 per year, comparable to the
author fees for a single open access article.
… nearly 200 libraries in 24 countries, who have made
contributions to support arXiv via the membership program ….
GitHub Terms of Service
http://help.github.com/articles/github-terms-of-service
Paul Ginsparg (2017) Preprint Déjà Vu: an FAQ
https://arxiv.org/abs/1706.04188
37. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Hiberlink Evidence
Web resources referenced in Elsevier corpus (1996-2012)
without representative Memento in public web archives
38. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Inspiration
• LOCKSS
• Web crawling approach
• Focused on journal literature
• Archive-It
• On-demand, subscription-based web archiving
• Not focused on scholarly orphans
• Institutional repository, auto-discovery of journal articles
• Capture an institution’s output
• Focused on journal literature
• The Locker Project & Amy Guy’s Personal Web Observatory work
• Capture an individual’s web presence
• Not focused on scholarly orphans
http://rhiaro.co.uk/
https://rhiaro.github.io/thesis/
39. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Tracking Artifacts - Implementation
• Tracker event notifications:
• Linked Data Notifications (JSON-LD) using AS2, PROV-O,
schema.org
• Identifiers: Unique tracker event identifier per notification
• Dates: artifact publication date & artifact tracked date
• URIs: 1+ artifact URI
• Event database:
• Notifications stored/indexed in ElasticSearch
• Researcher database:
• SQLite
45. @hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts - Challenges
• Memento Tracer:
• Language used to express Traces (interoperability)
• Organization of the shared repository for Traces
• Limitations of the browser event listener approach for recording
Traces
• Selection of a Trace for capturing a web publication by other
means than URI pattern
• Legal constraints
Platforms where we could expect artifacts to be – they are not
Selective deposit: Their motivation is typically visibility/cite-ability (DOI), not archiving
Researchers don’t do it for the longevity of artifacts
Exploring what this archival layer could be
Technical aspects, who runs is etc
University of Porto: 100+ years old
University of Coimbra: 700+ years old
ETDs as an example to support institution’s and library’s mission
Institution in contrast to platform argument
Not going for commercial, non-profit platform – institutional solution b/c shelf life
Transition to technical part
Effort put into this: 6 months 6 ppl
Pipeline is top 3
Track stuff researchers put in portals on the web
Once know, capture it, put in institutional archive
Contribute them to cross-institutional archive
All coordinated by central process
Whenever one component is done, Info stored in event DB
DB knows about all artifact metadata
Left: An institution, each institution runs this or has someone run it on their behalf
Right: global, one organization or a consortium
Even inst. can be powered by cross-institutional tools
Tools shared beyond institution
Orchestrator sends message to tracker on regular basis
”For this user track this portal”
Tracker sends back message with info about new artifact
Incremental polling of artifacts by users
We don’t have this info, ORCID would be logical to have it – gap in infrastructure
If researchers would use FOAF info, could use it from there
Manual process in institution to collect
Content drift, missing out of info, artifact has changed since
Pushed by portals, subscribe
API restrictions
Mention Frank Shipman study
We already have an institutional archive, WARC archive
Also create an cross-inst web archive
Currently central WA component
Storage could be decentralized via IPFS
Fill out N
Pipeline is top 3
Track stuff researchers put in portals on the web
Once know, capture it, put in institutional archive
Contribute them to cross-institutional archive
All coordinated by central process
Whenever one component is done, Info stored in event DB
DB knows about all artifact metadata
- As opposed to global central portal thinking
Bring down the scale by institutional perspective
4k researchers manageable, 5m not so much, need to be google
~100k articles with links
> 230k links total
Similarities inspire us
Different from what we are doing
IR auto discovery is similar , connection point
More about locker and AG
Discover and capture personal contributions in social platforms
Focus on individual
Pictures in flickr
All events are sent as LDN, all use AS2, prov, schema as vocabulary
Crucial pieces of info in notification: IDs, dates, URIs
Specific for tracker: this ID, this date, these URIs
- Researcher DB contains ORCID of researchers as primary key and corresponding web portal IDs
- Tracker DB contains portal info
New paradigm for web archiving, found as part of this problem
Unexpected, yet most important result/contribution of this effort
Lets imagine you need to frequently archive slide decks from SlideShare (we do)
Understand that there are boundary and quality problems
Bring human (curator) in the loop
Navigate to *one* SS presentation
Interact with that presentation in an attempt to show what the boundary is, make explicit what needs to be archived
Browser extension, listens to browser events, intercepts them and records them in abstract way (not in terms of URLs, addresses in the DOM, Xpath, CSS selectors)
Result: trace expresses in abstract way the interactions the curator had with slide deck
Abstract b/c same info how to interact with *this* presentation will apply to *all* presentations
Record one, share, re-use with headless browser
Share in repo, collectively create, curate traces, update with layout of pages
Canvas, google maps
Fetch WARC
Archive can receive messages from multiple institutions
archiver event message back to institutional orchestrator