SlideShare a Scribd company logo
1 of 46
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
To the Rescue of Scholarly Orphans
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Scholarly Orphans Team
• Los Alamos National Laboratory:
• Lyudmila Balakireva
• Martin Klein
• James Powell
• Harihar Shankar
• Herbert Van de Sompel
• Old Dominion University:
• Sawood Alam
• Grant Atkins
• Shawn Jones
• Mat Kelly
• Michael L. Nelson
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Scholarly Orphans – Project Motivation
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
• Consideration
• Researchers deposit artifacts in web platforms
• Web Platforms:
• Dedicated to scholarship:
• Commercial: e.g., FigShare, Publons
• Not for profit: e.g., OSF, Zenodo
• General purpose:
• Commercial: e.g., GitHub, SlideShare
• Not for profit: e.g., Wikipedia, Wikidata
Research and Research Communication on the Web
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - The researchers’ institutions are in the dark
• Do not know about the existence of these artifact
• Do not have a copy of these artifacts
Research and Research Communication on the Web
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo – Uncertainty regarding long-term access
• Commercial: changing business model, no preservation commitment
• Not for profit: unpredictable funding stream
Research and Research Communication on the Web
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
• Consideration
• Researchers deposit artifacts in web platforms
• Status quo - Not systematically archived
• No frameworks like LOCKSS/Portico exist for these artifacts
• Researchers only selectively deposit artifacts in portals that
provide archival guarantees; to obtain a cite-able DOI
• Can’t expect researchers to (also) upload all artifacts in IRs
• Web archives only incidentally archive these artifacts, cf.
Hiberlink research
Research and Research Communication on the Web
Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE
https://doi.org/10.1371/journal.pone.0115253
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Scholarly Orphans – Project Overview
How to capture Scholarly Orphans for long-term archiving?
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
The Scholarly Orphans Project
• Explores an institution-driven paradigm
• Academic institutions typically have a long shelf life
• A basic premise underlying e.g., LOCKSS, perma.cc
• An academic institution should be interested in capturing the
artifacts (intellectual property) its scholars deposit on the web
• Collecting and archiving such artifacts aligns with the
mission of academic libraries
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
An Institutional Perspective
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
The Scholarly Orphans Project
• Explores a paradigm inspired by web archiving
• Scale of the problem
• Can’t expect researchers to upload all artifacts in an institutional
repository
• Bilateral agreements for archival purposes with most web
portals unlikely
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
A Web Archiving Perspective
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Scholarly Orphans – Prototype Pipeline Overview
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Prototype Pipeline
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Prototype Pipeline
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Tracking Artifacts
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Tracking Artifacts - Description
• In order to track artifacts that were recently deposited by an
institutional researcher in a portal, one reasonably needs:
• The web identity of the researcher in the portal
• Algorithmic discovery
• Discovery via a registry
• Manual collection
• A portal API that supports:
• Access by web identity
• Access to contributions “since …” for the web identity
• Result of tracking:
• URI(s) of new artifact(s) discovered in the portal
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Tracking Artifacts - Challenges
• Portal API access by web identity
• Broadly supported by general purpose portals
• Typically not supported by scholarly portals
• Some lack an API altogether
• Should add ORCID access to APIs
• OAI-PMH and ResourceSync need sets per web identity
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts - Description
• The capture process takes as input the URI of a new artifact
discovered in a portal
• Its task is to create a representative institutional capture of the
artifact
• Result of capture:
• WARC file for new artifact in an institutional archive
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts - Challenges
• Delineate the web boundary of the artifact
• More than the input artifact URI
• The boundary is in the eye of the beholder
• Create a high-fidelity capture using an approach that scales for a
steady stream of new artifacts
• Unsolved problem
• We made a significant breakthrough with the Memento Tracer
framework
Memento Tracer: http://tracer.mementoweb.org
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Archiving Artifacts
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Archiving Artifacts - Description
• The archiving process takes as input the URI of a WARC file
generated by the capture process
• Its task is to ingest the WARC file in a cross-institutional web archive
• This can be achieved using off-the-shelf web archiving software,
e.g., pywb, Open Wayback
• Result of archiving:
• Mementos pertaining to newly discovered artifact in a cross-
institutional, Memento-compliant web archive
• Possibility to link to artifacts using Robust Links:
<a href=“URI-A”
data-versionurl=“URI-M”
data-versiondate=“date-of-capture”
Robust Links: http://robustlinks.mementoweb.org/about/
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Archiving Artifacts - Challenges
• Attempted to use ipwb, a pywb version that uses IPFS
• Cross-institutional distributed file system with redundancy
• Ran out of time to get it operationally stable
Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive
https://doi.org/10.1145/2910896.2925467
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Demo - myresearch.institute
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
myresearch.institute - Researchers
• Uniquely identified by ORCIDs
• Web identities in multiple portals
• Create various types of artifacts
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
myresearch.institute - Portals
• Tracking started August 27 2018
• Tracking artifacts created starting
August 1 2018
• 9000+ artifacts tracked to date
for all 16 researchers
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
myresearch.institute - Artifacts
• schema.org typology:
• Answer
• Article
• BlogPosting
• Comment
• Dataset
• PresentationDigitalDocument
• Question
• Review
• SoftwareSourceCode
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Pipeline Demo
https://myresearchinstitute.org
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Scholarly Orphans – Summary
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Summary (1/2)
• The Scholarly Orphans project explores an institution-driven
approach to capture scholarly artifacts deposited in web portals
• Artifacts out of scope of existing archival approaches such as
LOCKSS, Portico, web archives
• Institutions have a long shelf life, should be interested in
collecting these artifacts, and have feasible scale for
identity/artifact discovery
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Summary (2/2)
• Components of the experimental pipeline:
• Tracker: Automatically discover artifacts because researchers
will not upload them to the institution
• Capturer: High fidelity artifact captures through crowd-sourcing
navigation patterns with Memento Tracer
• Archiver: Cross-institutional, Memento-compliant scholarly web
archive
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Herbert Van de Sompel
DANS
@hvdsomp
https://orcid.org/0000-0002-0715-6126
To the Rescue of Scholarly Orphans
The Scholarly Orphans project
is funded by the Andrew W. Mellon Foundation
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
No Long-Term Access Guarantees
“We reserve the right at any time and from time to time to modify or
discontinue, temporarily or permanently, the Website (or any part of
it) with or without notice.
GitHub Terms of Service
http://help.github.com/articles/github-terms-of-service
https://help.github.com/articles/github-terms-of-service/
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Funding Not for Profits
FAQ 19: How is arXiv funded and governed?
Answer: It is currently supported by funds from a network of
member libraries, the Simon’s foundation and financial support,
labor, and infrastructure provided by the Cornell University Library.
The annual membership fees depend on the institutional usage
and range from $1,500 to just $3,000 per year, comparable to the
author fees for a single open access article.
… nearly 200 libraries in 24 countries, who have made
contributions to support arXiv via the membership program ….
GitHub Terms of Service
http://help.github.com/articles/github-terms-of-service
Paul Ginsparg (2017) Preprint Déjà Vu: an FAQ
https://arxiv.org/abs/1706.04188
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Hiberlink Evidence
Web resources referenced in Elsevier corpus (1996-2012)
without representative Memento in public web archives
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Inspiration
• LOCKSS
• Web crawling approach
• Focused on journal literature
• Archive-It
• On-demand, subscription-based web archiving
• Not focused on scholarly orphans
• Institutional repository, auto-discovery of journal articles
• Capture an institution’s output
• Focused on journal literature
• The Locker Project & Amy Guy’s Personal Web Observatory work
• Capture an individual’s web presence
• Not focused on scholarly orphans
http://rhiaro.co.uk/
https://rhiaro.github.io/thesis/
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Tracking Artifacts - Implementation
• Tracker event notifications:
• Linked Data Notifications (JSON-LD) using AS2, PROV-O,
schema.org
• Identifiers: Unique tracker event identifier per notification
• Dates: artifact publication date & artifact tracked date
• URIs: 1+ artifact URI
• Event database:
• Notifications stored/indexed in ElasticSearch
• Researcher database:
• SQLite
Tracking Artifacts - Architecture
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts - Implementation
• Capture event notifications:
• Identifiers: Unique capture event identifier per notification ;
Preceding tracker event identifier conveyed as provenance
• Dates: Datetime of WARC file creation
• URIs: 1+ WARC file URI
• Tracer, client-side:
• Tracer Chrome extension leveraging Selenium IDE
• Tracer, server-side:
• Stormcrawler ; Selenium (Chrome) with Tracer plug-in ;
WarcProxy ; file-system storage for WARC files
http://stormcrawler.net/
https://www.seleniumhq.org/projects/webdriver/
https://github.com/odie5533/WarcProxy
Capturing Artifacts - Architecture
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Archiving Artifacts - Implementation
• Archiver event notifications:
• Identifiers: Unique archiver event identifier per notification ;
preceding tracker/capturer event identifiers conveyed as
provenance
• Dates: WARC file ingest date ; Memento-Datetime values
URIs: 1+ Memento URI, each corresponding to an artifact URI
• Web Archive:
• pywb
• Social card:
• MementoEmbed
https://github.com/webrecorder/pywb
https://github.com/oduwsdl/MementoEmbed
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Memento Tracer - Framework
http://tracer.mementoweb.org
@hvdsomp
PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019
Capturing Artifacts - Challenges
• Memento Tracer:
• Language used to express Traces (interoperability)
• Organization of the shared repository for Traces
• Limitations of the browser event listener approach for recording
Traces
• Selection of a Trace for capturing a web publication by other
means than URI pattern
• Legal constraints
Archiving Artifacts - Architecture

More Related Content

What's hot

IFLA ARL Webinar Series: Digital Preservation - Managing Publications and Dat...
IFLA ARL Webinar Series: Digital Preservation - Managing Publications and Dat...IFLA ARL Webinar Series: Digital Preservation - Managing Publications and Dat...
IFLA ARL Webinar Series: Digital Preservation - Managing Publications and Dat...IFLAAcademicandResea
 
Web archiving collaborations: a presentation for colleagues working in the Li...
Web archiving collaborations: a presentation for colleagues working in the Li...Web archiving collaborations: a presentation for colleagues working in the Li...
Web archiving collaborations: a presentation for colleagues working in the Li...Anna Perricci
 
Thinking about resource issues: copyright and open access
Thinking about resource issues: copyright and open accessThinking about resource issues: copyright and open access
Thinking about resource issues: copyright and open accessAllison Fullard
 
Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Anna Perricci
 
CDN Getting Best Value from College Licenses: Jisc MediaHub
CDN Getting  Best Value from College Licenses: Jisc MediaHubCDN Getting  Best Value from College Licenses: Jisc MediaHub
CDN Getting Best Value from College Licenses: Jisc MediaHubEDINA, University of Edinburgh
 
1818 societypresentation revised2013
1818 societypresentation revised20131818 societypresentation revised2013
1818 societypresentation revised2013Eliza McLeod
 
Online Educational Resources for Researcher
Online Educational Resources for ResearcherOnline Educational Resources for Researcher
Online Educational Resources for ResearcherBhupendra Bansod
 

What's hot (8)

IFLA ARL Webinar Series: Digital Preservation - Managing Publications and Dat...
IFLA ARL Webinar Series: Digital Preservation - Managing Publications and Dat...IFLA ARL Webinar Series: Digital Preservation - Managing Publications and Dat...
IFLA ARL Webinar Series: Digital Preservation - Managing Publications and Dat...
 
Web archiving collaborations: a presentation for colleagues working in the Li...
Web archiving collaborations: a presentation for colleagues working in the Li...Web archiving collaborations: a presentation for colleagues working in the Li...
Web archiving collaborations: a presentation for colleagues working in the Li...
 
Thinking about resource issues: copyright and open access
Thinking about resource issues: copyright and open accessThinking about resource issues: copyright and open access
Thinking about resource issues: copyright and open access
 
Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct Collaborative Web Archiving with Ivy Plus / Borrow Direct
Collaborative Web Archiving with Ivy Plus / Borrow Direct
 
Digital Library 
Digital Library Digital Library 
Digital Library 
 
CDN Getting Best Value from College Licenses: Jisc MediaHub
CDN Getting  Best Value from College Licenses: Jisc MediaHubCDN Getting  Best Value from College Licenses: Jisc MediaHub
CDN Getting Best Value from College Licenses: Jisc MediaHub
 
1818 societypresentation revised2013
1818 societypresentation revised20131818 societypresentation revised2013
1818 societypresentation revised2013
 
Online Educational Resources for Researcher
Online Educational Resources for ResearcherOnline Educational Resources for Researcher
Online Educational Resources for Researcher
 

Similar to To the Rescue of Scholarly Orphans

An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansMartin Klein
 
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...The Frick Collection
 
PRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata ExchangePRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata ExchangeBrian Hole
 
OCAD University Open Research Repository
OCAD University Open Research RepositoryOCAD University Open Research Repository
OCAD University Open Research RepositoryJill Patrick
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly ArtifactsMartin Klein
 
NORFest 2023 Lightning Talks Session Three
NORFest 2023 Lightning Talks Session Three NORFest 2023 Lightning Talks Session Three
NORFest 2023 Lightning Talks Session Three dri_ireland
 
Transforming University Research - Mar 2006
Transforming University Research - Mar 2006Transforming University Research - Mar 2006
Transforming University Research - Mar 2006Jill Patrick
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...datascienceiqss
 
A digital library survival kit
A digital library survival kitA digital library survival kit
A digital library survival kitPeter Clarke
 
Introducing PRIME:Publisher, Repository and Institutional Metadata Exchange
Introducing PRIME:Publisher, Repository and Institutional Metadata ExchangeIntroducing PRIME:Publisher, Repository and Institutional Metadata Exchange
Introducing PRIME:Publisher, Repository and Institutional Metadata ExchangeBrian Hole
 
Kate Kelly - Open Research Ireland
Kate Kelly - Open Research IrelandKate Kelly - Open Research Ireland
Kate Kelly - Open Research Irelanddri_ireland
 
The public library and wikipedia
The public library and wikipediaThe public library and wikipedia
The public library and wikipediadorohoward
 
Advocating Open Access: Before, during and after HEFCE
Advocating Open Access: Before, during and after HEFCEAdvocating Open Access: Before, during and after HEFCE
Advocating Open Access: Before, during and after HEFCENick Sheppard
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsMartin Klein
 
Deconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communicationDeconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communicationJisc
 
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...The Frick Collection
 
Manage it locally to share it globally: RDM and Wikimedia Commons
Manage it locally to share it globally: RDM and Wikimedia CommonsManage it locally to share it globally: RDM and Wikimedia Commons
Manage it locally to share it globally: RDM and Wikimedia CommonsNick Sheppard
 
Dr Natalie Harrower - DRI and Open Data
Dr Natalie Harrower - DRI and Open DataDr Natalie Harrower - DRI and Open Data
Dr Natalie Harrower - DRI and Open Datadri_ireland
 
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...Brian Hole
 

Similar to To the Rescue of Scholarly Orphans (20)

An Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly OrphansAn Institutional Perspective to Rescue Scholarly Orphans
An Institutional Perspective to Rescue Scholarly Orphans
 
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
 
PRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata ExchangePRIME: Publisher, Repository & Institutional Metadata Exchange
PRIME: Publisher, Repository & Institutional Metadata Exchange
 
OCAD University Open Research Repository
OCAD University Open Research RepositoryOCAD University Open Research Repository
OCAD University Open Research Repository
 
Amenu Kpodo
Amenu KpodoAmenu Kpodo
Amenu Kpodo
 
A Vision of the Library’s Role in Archiving Scholarly Artifacts
A Vision of the Library’s Role  in Archiving Scholarly ArtifactsA Vision of the Library’s Role  in Archiving Scholarly Artifacts
A Vision of the Library’s Role in Archiving Scholarly Artifacts
 
NORFest 2023 Lightning Talks Session Three
NORFest 2023 Lightning Talks Session Three NORFest 2023 Lightning Talks Session Three
NORFest 2023 Lightning Talks Session Three
 
Transforming University Research - Mar 2006
Transforming University Research - Mar 2006Transforming University Research - Mar 2006
Transforming University Research - Mar 2006
 
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
Preservation of Research Data: Dataverse / Archivematica Integration by Allan...
 
A digital library survival kit
A digital library survival kitA digital library survival kit
A digital library survival kit
 
Introducing PRIME:Publisher, Repository and Institutional Metadata Exchange
Introducing PRIME:Publisher, Repository and Institutional Metadata ExchangeIntroducing PRIME:Publisher, Repository and Institutional Metadata Exchange
Introducing PRIME:Publisher, Repository and Institutional Metadata Exchange
 
Kate Kelly - Open Research Ireland
Kate Kelly - Open Research IrelandKate Kelly - Open Research Ireland
Kate Kelly - Open Research Ireland
 
The public library and wikipedia
The public library and wikipediaThe public library and wikipedia
The public library and wikipedia
 
Advocating Open Access: Before, during and after HEFCE
Advocating Open Access: Before, during and after HEFCEAdvocating Open Access: Before, during and after HEFCE
Advocating Open Access: Before, during and after HEFCE
 
A Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly ArtifactsA Web-Centric Pipeline for Archiving Scholarly Artifacts
A Web-Centric Pipeline for Archiving Scholarly Artifacts
 
Deconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communicationDeconstructed and decentralized scholarly communication
Deconstructed and decentralized scholarly communication
 
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
Making the Black Hole Gray: Web Archiving Art Resources at New York Art Resou...
 
Manage it locally to share it globally: RDM and Wikimedia Commons
Manage it locally to share it globally: RDM and Wikimedia CommonsManage it locally to share it globally: RDM and Wikimedia Commons
Manage it locally to share it globally: RDM and Wikimedia Commons
 
Dr Natalie Harrower - DRI and Open Data
Dr Natalie Harrower - DRI and Open DataDr Natalie Harrower - DRI and Open Data
Dr Natalie Harrower - DRI and Open Data
 
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
The Journal of Open Archaeology Data and PRIME: Incentivising Open Data Archi...
 

More from Herbert Van de Sompel

The web is rotting and what to do about it
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about itHerbert Van de Sompel
 
Researcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebResearcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebHerbert Van de Sompel
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DoneHerbert Van de Sompel
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueHerbert Van de Sompel
 
Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)Herbert Van de Sompel
 
Collecting the organizational scholarly record
Collecting the organizational scholarly recordCollecting the organizational scholarly record
Collecting the organizational scholarly recordHerbert Van de Sompel
 
Achieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed CollectionsAchieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed CollectionsHerbert Van de Sompel
 
Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)Herbert Van de Sompel
 
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTDBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTHerbert Van de Sompel
 
Interoperability for web based scholarship
Interoperability for web based scholarshipInteroperability for web based scholarship
Interoperability for web based scholarshipHerbert Van de Sompel
 
A Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordA Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordHerbert Van de Sompel
 

More from Herbert Van de Sompel (20)

The web is rotting and what to do about it
The web is rotting and what to do about itThe web is rotting and what to do about it
The web is rotting and what to do about it
 
Researcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized WebResearcher Pod: Scholarly Communication Using the Decentralized Web
Researcher Pod: Scholarly Communication Using the Decentralized Web
 
Persistent Identification: Easier Said than Done
Persistent Identification: Easier Said than DonePersistent Identification: Easier Said than Done
Persistent Identification: Easier Said than Done
 
FAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning IssueFAIR Signposting: A KISS Approach to a Burning Issue
FAIR Signposting: A KISS Approach to a Burning Issue
 
Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)Registration / Certification Interoperability Architecture (overlay peer-review)
Registration / Certification Interoperability Architecture (overlay peer-review)
 
Collecting the organizational scholarly record
Collecting the organizational scholarly recordCollecting the organizational scholarly record
Collecting the organizational scholarly record
 
Almost two decades at LANL
Almost two decades at LANLAlmost two decades at LANL
Almost two decades at LANL
 
Perseverance on Persistence
Perseverance on PersistencePerseverance on Persistence
Perseverance on Persistence
 
Paul Evan Peters Lecture
Paul Evan Peters LecturePaul Evan Peters Lecture
Paul Evan Peters Lecture
 
Achieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed CollectionsAchieving Link Integrity for Managed Collections
Achieving Link Integrity for Managed Collections
 
Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)Signposting Overview (Version November 2017)
Signposting Overview (Version November 2017)
 
Signposting Overview
Signposting OverviewSignposting Overview
Signposting Overview
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
 
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDTDBpedia Archive using Memento, Triple Pattern Fragments, and HDT
DBpedia Archive using Memento, Triple Pattern Fragments, and HDT
 
Interoperability for web based scholarship
Interoperability for web based scholarshipInteroperability for web based scholarship
Interoperability for web based scholarship
 
Reminiscing about interoperability
Reminiscing about interoperabilityReminiscing about interoperability
Reminiscing about interoperability
 
Creating Pockets of Persistence
Creating Pockets of PersistenceCreating Pockets of Persistence
Creating Pockets of Persistence
 
ResourceSync Quick Overview
ResourceSync Quick OverviewResourceSync Quick Overview
ResourceSync Quick Overview
 
Memento 101
Memento 101Memento 101
Memento 101
 
A Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly RecordA Perspective on Archiving the Scholarly Record
A Perspective on Archiving the Scholarly Record
 

Recently uploaded

Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxNIMMANAGANTI RAMAKRISHNA
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 

Recently uploaded (11)

Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
ETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptxETHICAL HACKING dddddddddddddddfnandni.pptx
ETHICAL HACKING dddddddddddddddfnandni.pptx
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 

To the Rescue of Scholarly Orphans

  • 1. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 To the Rescue of Scholarly Orphans The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  • 2. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans Team • Los Alamos National Laboratory: • Lyudmila Balakireva • Martin Klein • James Powell • Harihar Shankar • Herbert Van de Sompel • Old Dominion University: • Sawood Alam • Grant Atkins • Shawn Jones • Mat Kelly • Michael L. Nelson
  • 3. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans – Project Motivation
  • 4. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 • Consideration • Researchers deposit artifacts in web platforms • Web Platforms: • Dedicated to scholarship: • Commercial: e.g., FigShare, Publons • Not for profit: e.g., OSF, Zenodo • General purpose: • Commercial: e.g., GitHub, SlideShare • Not for profit: e.g., Wikipedia, Wikidata Research and Research Communication on the Web
  • 5. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 • Consideration • Researchers deposit artifacts in web platforms • Status quo - The researchers’ institutions are in the dark • Do not know about the existence of these artifact • Do not have a copy of these artifacts Research and Research Communication on the Web
  • 6. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 • Consideration • Researchers deposit artifacts in web platforms • Status quo – Uncertainty regarding long-term access • Commercial: changing business model, no preservation commitment • Not for profit: unpredictable funding stream Research and Research Communication on the Web
  • 7. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 • Consideration • Researchers deposit artifacts in web platforms • Status quo - Not systematically archived • No frameworks like LOCKSS/Portico exist for these artifacts • Researchers only selectively deposit artifacts in portals that provide archival guarantees; to obtain a cite-able DOI • Can’t expect researchers to (also) upload all artifacts in IRs • Web archives only incidentally archive these artifacts, cf. Hiberlink research Research and Research Communication on the Web Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  • 8. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans – Project Overview How to capture Scholarly Orphans for long-term archiving?
  • 9. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 The Scholarly Orphans Project • Explores an institution-driven paradigm • Academic institutions typically have a long shelf life • A basic premise underlying e.g., LOCKSS, perma.cc • An academic institution should be interested in capturing the artifacts (intellectual property) its scholars deposit on the web • Collecting and archiving such artifacts aligns with the mission of academic libraries
  • 10. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 An Institutional Perspective
  • 11. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 The Scholarly Orphans Project • Explores a paradigm inspired by web archiving • Scale of the problem • Can’t expect researchers to upload all artifacts in an institutional repository • Bilateral agreements for archival purposes with most web portals unlikely
  • 12. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 A Web Archiving Perspective
  • 13. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans – Prototype Pipeline Overview
  • 14. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Prototype Pipeline
  • 15. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Prototype Pipeline
  • 16. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Tracking Artifacts
  • 17. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Tracking Artifacts - Description • In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: • The web identity of the researcher in the portal • Algorithmic discovery • Discovery via a registry • Manual collection • A portal API that supports: • Access by web identity • Access to contributions “since …” for the web identity • Result of tracking: • URI(s) of new artifact(s) discovered in the portal
  • 18. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Tracking Artifacts - Challenges • Portal API access by web identity • Broadly supported by general purpose portals • Typically not supported by scholarly portals • Some lack an API altogether • Should add ORCID access to APIs • OAI-PMH and ResourceSync need sets per web identity
  • 19. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts
  • 20. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts - Description • The capture process takes as input the URI of a new artifact discovered in a portal • Its task is to create a representative institutional capture of the artifact • Result of capture: • WARC file for new artifact in an institutional archive
  • 21. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts - Challenges • Delineate the web boundary of the artifact • More than the input artifact URI • The boundary is in the eye of the beholder • Create a high-fidelity capture using an approach that scales for a steady stream of new artifacts • Unsolved problem • We made a significant breakthrough with the Memento Tracer framework Memento Tracer: http://tracer.mementoweb.org
  • 22. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts
  • 23. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Archiving Artifacts
  • 24. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Archiving Artifacts - Description • The archiving process takes as input the URI of a WARC file generated by the capture process • Its task is to ingest the WARC file in a cross-institutional web archive • This can be achieved using off-the-shelf web archiving software, e.g., pywb, Open Wayback • Result of archiving: • Mementos pertaining to newly discovered artifact in a cross- institutional, Memento-compliant web archive • Possibility to link to artifacts using Robust Links: <a href=“URI-A” data-versionurl=“URI-M” data-versiondate=“date-of-capture” Robust Links: http://robustlinks.mementoweb.org/about/
  • 25. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Archiving Artifacts - Challenges • Attempted to use ipwb, a pywb version that uses IPFS • Cross-institutional distributed file system with redundancy • Ran out of time to get it operationally stable Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive https://doi.org/10.1145/2910896.2925467
  • 26. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Demo - myresearch.institute
  • 27. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 myresearch.institute - Researchers • Uniquely identified by ORCIDs • Web identities in multiple portals • Create various types of artifacts
  • 28. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 myresearch.institute - Portals • Tracking started August 27 2018 • Tracking artifacts created starting August 1 2018 • 9000+ artifacts tracked to date for all 16 researchers
  • 29. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 myresearch.institute - Artifacts • schema.org typology: • Answer • Article • BlogPosting • Comment • Dataset • PresentationDigitalDocument • Question • Review • SoftwareSourceCode
  • 30. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Pipeline Demo https://myresearchinstitute.org
  • 31. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans – Summary
  • 32. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Summary (1/2) • The Scholarly Orphans project explores an institution-driven approach to capture scholarly artifacts deposited in web portals • Artifacts out of scope of existing archival approaches such as LOCKSS, Portico, web archives • Institutions have a long shelf life, should be interested in collecting these artifacts, and have feasible scale for identity/artifact discovery
  • 33. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Summary (2/2) • Components of the experimental pipeline: • Tracker: Automatically discover artifacts because researchers will not upload them to the institution • Capturer: High fidelity artifact captures through crowd-sourcing navigation patterns with Memento Tracer • Archiver: Cross-institutional, Memento-compliant scholarly web archive
  • 34. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 To the Rescue of Scholarly Orphans The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  • 35. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 No Long-Term Access Guarantees “We reserve the right at any time and from time to time to modify or discontinue, temporarily or permanently, the Website (or any part of it) with or without notice. GitHub Terms of Service http://help.github.com/articles/github-terms-of-service https://help.github.com/articles/github-terms-of-service/
  • 36. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Funding Not for Profits FAQ 19: How is arXiv funded and governed? Answer: It is currently supported by funds from a network of member libraries, the Simon’s foundation and financial support, labor, and infrastructure provided by the Cornell University Library. The annual membership fees depend on the institutional usage and range from $1,500 to just $3,000 per year, comparable to the author fees for a single open access article. … nearly 200 libraries in 24 countries, who have made contributions to support arXiv via the membership program …. GitHub Terms of Service http://help.github.com/articles/github-terms-of-service Paul Ginsparg (2017) Preprint Déjà Vu: an FAQ https://arxiv.org/abs/1706.04188
  • 37. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Hiberlink Evidence Web resources referenced in Elsevier corpus (1996-2012) without representative Memento in public web archives
  • 38. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Inspiration • LOCKSS • Web crawling approach • Focused on journal literature • Archive-It • On-demand, subscription-based web archiving • Not focused on scholarly orphans • Institutional repository, auto-discovery of journal articles • Capture an institution’s output • Focused on journal literature • The Locker Project & Amy Guy’s Personal Web Observatory work • Capture an individual’s web presence • Not focused on scholarly orphans http://rhiaro.co.uk/ https://rhiaro.github.io/thesis/
  • 39. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Tracking Artifacts - Implementation • Tracker event notifications: • Linked Data Notifications (JSON-LD) using AS2, PROV-O, schema.org • Identifiers: Unique tracker event identifier per notification • Dates: artifact publication date & artifact tracked date • URIs: 1+ artifact URI • Event database: • Notifications stored/indexed in ElasticSearch • Researcher database: • SQLite
  • 40. Tracking Artifacts - Architecture
  • 41. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts - Implementation • Capture event notifications: • Identifiers: Unique capture event identifier per notification ; Preceding tracker event identifier conveyed as provenance • Dates: Datetime of WARC file creation • URIs: 1+ WARC file URI • Tracer, client-side: • Tracer Chrome extension leveraging Selenium IDE • Tracer, server-side: • Stormcrawler ; Selenium (Chrome) with Tracer plug-in ; WarcProxy ; file-system storage for WARC files http://stormcrawler.net/ https://www.seleniumhq.org/projects/webdriver/ https://github.com/odie5533/WarcProxy
  • 42. Capturing Artifacts - Architecture
  • 43. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Archiving Artifacts - Implementation • Archiver event notifications: • Identifiers: Unique archiver event identifier per notification ; preceding tracker/capturer event identifiers conveyed as provenance • Dates: WARC file ingest date ; Memento-Datetime values URIs: 1+ Memento URI, each corresponding to an artifact URI • Web Archive: • pywb • Social card: • MementoEmbed https://github.com/webrecorder/pywb https://github.com/oduwsdl/MementoEmbed
  • 44. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Memento Tracer - Framework http://tracer.mementoweb.org
  • 45. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts - Challenges • Memento Tracer: • Language used to express Traces (interoperability) • Organization of the shared repository for Traces • Limitations of the browser event listener approach for recording Traces • Selection of a Trace for capturing a web publication by other means than URI pattern • Legal constraints
  • 46. Archiving Artifacts - Architecture

Editor's Notes

  1. examples of artifacts: github issues annotations stand-alone peer-review scientific blogs etc
  2. Platforms where we could expect artifacts to be – they are not Selective deposit: Their motivation is typically visibility/cite-ability (DOI), not archiving Researchers don’t do it for the longevity of artifacts
  3. Exploring what this archival layer could be Technical aspects, who runs is etc
  4. University of Porto: 100+ years old University of Coimbra: 700+ years old ETDs as an example to support institution’s and library’s mission Institution in contrast to platform argument Not going for commercial, non-profit platform – institutional solution b/c shelf life
  5. Transition to technical part Effort put into this: 6 months 6 ppl
  6. Pipeline is top 3 Track stuff researchers put in portals on the web Once know, capture it, put in institutional archive Contribute them to cross-institutional archive All coordinated by central process Whenever one component is done, Info stored in event DB DB knows about all artifact metadata
  7. Left: An institution, each institution runs this or has someone run it on their behalf Right: global, one organization or a consortium Even inst. can be powered by cross-institutional tools Tools shared beyond institution
  8. Orchestrator sends message to tracker on regular basis ”For this user track this portal” Tracker sends back message with info about new artifact
  9. Incremental polling of artifacts by users
  10. We don’t have this info, ORCID would be logical to have it – gap in infrastructure If researchers would use FOAF info, could use it from there Manual process in institution to collect Content drift, missing out of info, artifact has changed since Pushed by portals, subscribe API restrictions
  11. Mention Frank Shipman study
  12. We already have an institutional archive, WARC archive Also create an cross-inst web archive
  13. Currently central WA component Storage could be decentralized via IPFS
  14. Fill out N
  15. Pipeline is top 3 Track stuff researchers put in portals on the web Once know, capture it, put in institutional archive Contribute them to cross-institutional archive All coordinated by central process Whenever one component is done, Info stored in event DB DB knows about all artifact metadata
  16. - As opposed to global central portal thinking Bring down the scale by institutional perspective 4k researchers manageable, 5m not so much, need to be google
  17. ~100k articles with links > 230k links total
  18. Similarities inspire us Different from what we are doing IR auto discovery is similar , connection point More about locker and AG Discover and capture personal contributions in social platforms Focus on individual Pictures in flickr
  19. All events are sent as LDN, all use AS2, prov, schema as vocabulary Crucial pieces of info in notification: IDs, dates, URIs Specific for tracker: this ID, this date, these URIs
  20. - Researcher DB contains ORCID of researchers as primary key and corresponding web portal IDs - Tracker DB contains portal info
  21. New paradigm for web archiving, found as part of this problem Unexpected, yet most important result/contribution of this effort Lets imagine you need to frequently archive slide decks from SlideShare (we do) Understand that there are boundary and quality problems Bring human (curator) in the loop Navigate to *one* SS presentation Interact with that presentation in an attempt to show what the boundary is, make explicit what needs to be archived Browser extension, listens to browser events, intercepts them and records them in abstract way (not in terms of URLs, addresses in the DOM, Xpath, CSS selectors) Result: trace expresses in abstract way the interactions the curator had with slide deck Abstract b/c same info how to interact with *this* presentation will apply to *all* presentations Record one, share, re-use with headless browser Share in repo, collectively create, curate traces, update with layout of pages
  22. Canvas, google maps
  23. Fetch WARC Archive can receive messages from multiple institutions archiver event message back to institutional orchestrator