SlideShare a Scribd company logo
1 of 23
Download to read offline
Mind the gap!
Reflections on the state of
repository data harvesting
Simeon Warner (Cornell University)
http://orcid.org/0000-0002-7970-7855
Long long ago,
when XML was hard,
Unicode was merely one
possible character set,
a big hard drive was 10GB,
and HotBot & AltaVista
had a new competitor...
... it was1999 and the UPS meeting in
Santa Fe aimed to
“... identify technologies to stimulate
the adoption of the concept of [Open
Access] author self-archived systems in
scholarly communication; theorize a
framework for the integration of e-
print services in the academic
document system ...”
https://www.openarchives.org/meetings/SantaFe1999/ups-invitation-ori.htm
Thus was born
OAI-PMH
v1.0 2001, v1.1 2002, v2.0 2003
OAI-PMH was great!
•  It works
•  Scales to millions of items
•  Easy to implement (good s/w libraries)
•  XML, which brought UTF-8 (hurrah!)
•  Widely deployed, stable since 2003 (v2.0)
•  Registries & validators
•  Community & documentation
BASE harvests
>5000 sources
>112M documents
BUT...
•  Not RESTful
•  Repository-centric
•  XML metadata only
•  Metadata is wrapped
•  Dynamic set membership bug
"Currently, OAI-PMH is the only
behavior that is uniformly exposed by
most repositories.
[But], its focus on metadata, its pull-
based paradigm, and its technological
roots that date back to the web of the
nineties put it at odds with ... current
web technologies."
COAR Next Generation Repositories
http://comment.coar-repositories.org/2-next-generation-repositories/
Photo by drivethrucafe CC BY-SA
https://www.flickr.com/photos/128758398@N07/15836296662
Google Scholar
is great, but
not the answer
Replacement with no gap
New approach must:
•  Meet existing OAI-PMH use cases
•  Support content as well as metadata
•  Scale better
•  Follow web standards
•  Be modern, developer friendly
Push-me pull-you
many items / sources
low latency / efficiency
=> push/notification
modest size
low barrier
=> pull
Conclusion v1
We, the repository
community, need to
discuss and agree on
a new approach to
harvesting
ResourceSync
ANSI/NISO Z39.99-2017
Sitemaps +
•  multiple sets
•  fixity
•  links
•  changes only
•  dumps
+ Notifications (Push)
PubSubHubbub
WebSub
•  low latency
•  efficiency
CORE
>6000 journals
>2400 repositories
>77M articles
(>6M full text)
metadata + content
Slide from Petr Knoth / CORE – DPLAfest 2017 presentation -- https://goo.gl/vz3zuJ
Tested with
resync client. 20
x 25MB sitemaps,
1M items ✔
IIIF & Europeana
•  500,000,000+ IIIF resources – how to
find them?
•  JSON-LD documents and related web
pages
•  Europeana experiments with NLW and
UCD
o  ResourceSync, Sitemaps and native
structures
Hyku & DPLA
•  Extension of HydraSamvera codebase
to provide in-the-box repository
•  Native ResourceSync support
o  Both resource lists and change lists
•  Successful harvesting tests with DPLA
o  Desire for resource dumps and change
dumps for efficiency
(see new report:
http://hydrainabox.projecthydra.org/2017/06/22/resourcesync.html )
Conclusion v2
We, the repository
community, should
agree on & transition to
ResourceSync as the
new approach to
harvesting
Repository prescription
•  Metadata and content should be web
resources
o  stable URIs, follow web standards, not hidden
behind query interfaces
•  Support ResourceSync as the primary
harvesting interface
o  OAI-PMH as secondary where necessary
•  Distinguish and relate metadata and content
entries
That’s
all
folks
@zimeon
simeon.warner@cornell.edu

More Related Content

What's hot

SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebPascal-Nicolas Becker
 
Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data Asuncion Gomez-Perez
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage InformationEnno Meijers
 
ORDS, research data network
ORDS, research data networkORDS, research data network
ORDS, research data networkJisc RDM
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107皓仁 柯
 
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...WARCnet
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataPascal-Nicolas Becker
 
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar SlidesDuraSpace
 
Digital Infrastructure: Storage and Content Management
Digital Infrastructure: Storage and Content ManagementDigital Infrastructure: Storage and Content Management
Digital Infrastructure: Storage and Content ManagementNoreen Whysel
 
Linked Open Data for Cultural Heritage
Linked Open Data for Cultural HeritageLinked Open Data for Cultural Heritage
Linked Open Data for Cultural HeritageNoreen Whysel
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataAlbert Meroño-Peñuela
 
DSpace-CRIS Workshop OR2015: Slides
DSpace-CRIS Workshop OR2015: SlidesDSpace-CRIS Workshop OR2015: Slides
DSpace-CRIS Workshop OR2015: SlidesAndrea Bollini
 
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...Andrea Bollini
 
6.15.17 DSpace-Cris Webinar Presentation Slides
6.15.17 DSpace-Cris Webinar Presentation Slides6.15.17 DSpace-Cris Webinar Presentation Slides
6.15.17 DSpace-Cris Webinar Presentation SlidesDuraSpace
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsdgarijo
 
Repository technologies
Repository technologiesRepository technologies
Repository technologiesAndrea Bollini
 
Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingTobias Kuhn
 
ORCID Adoption & Integration in DSpace
ORCID Adoption & Integration in DSpaceORCID Adoption & Integration in DSpace
ORCID Adoption & Integration in DSpaceORCID, Inc
 

What's hot (20)

SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data Maximising (Re)Usability of Library metadata using Linked Data
Maximising (Re)Usability of Library metadata using Linked Data
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
ORDS, research data network
ORDS, research data networkORDS, research data network
ORDS, research data network
 
鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107鏈結資料在圖書館的應用20131107
鏈結資料在圖書館的應用20131107
 
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
Wednesday 6 May: Hand me the data! What you should know as a humanities resea...
 
Open Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked DataOpen Science Days 2014 - Becker - Repositories and Linked Data
Open Science Days 2014 - Becker - Repositories and Linked Data
 
Dash UCCSC 2016
Dash UCCSC 2016Dash UCCSC 2016
Dash UCCSC 2016
 
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
5.15.17 Powering Linked Data and Hosted Solutions with Fedora Webinar Slides
 
Digital Infrastructure: Storage and Content Management
Digital Infrastructure: Storage and Content ManagementDigital Infrastructure: Storage and Content Management
Digital Infrastructure: Storage and Content Management
 
Linked Open Data for Cultural Heritage
Linked Open Data for Cultural HeritageLinked Open Data for Cultural Heritage
Linked Open Data for Cultural Heritage
 
Connecting the Dots: Constellations in the Linked Data Universe
Connecting the Dots: Constellations in the Linked Data UniverseConnecting the Dots: Constellations in the Linked Data Universe
Connecting the Dots: Constellations in the Linked Data Universe
 
Making social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked dataMaking social science more reproducible by encapsulating access to linked data
Making social science more reproducible by encapsulating access to linked data
 
DSpace-CRIS Workshop OR2015: Slides
DSpace-CRIS Workshop OR2015: SlidesDSpace-CRIS Workshop OR2015: Slides
DSpace-CRIS Workshop OR2015: Slides
 
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...DSpace for Cultural Heritage: adding support for images visualization,audio/v...
DSpace for Cultural Heritage: adding support for images visualization,audio/v...
 
6.15.17 DSpace-Cris Webinar Presentation Slides
6.15.17 DSpace-Cris Webinar Presentation Slides6.15.17 DSpace-Cris Webinar Presentation Slides
6.15.17 DSpace-Cris Webinar Presentation Slides
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Repository technologies
Repository technologiesRepository technologies
Repository technologies
 
Nanopublications and Decentralized Publishing
Nanopublications and Decentralized PublishingNanopublications and Decentralized Publishing
Nanopublications and Decentralized Publishing
 
ORCID Adoption & Integration in DSpace
ORCID Adoption & Integration in DSpaceORCID Adoption & Integration in DSpace
ORCID Adoption & Integration in DSpace
 

Similar to Mind the gap! Reflections on the state of repository data harvesting

Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDavid Wallom
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraAlluxio, Inc.
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationSimeon Warner
 
Danis biosystematics2011
Danis biosystematics2011Danis biosystematics2011
Danis biosystematics2011Bruno Danis
 
Another history of the Web from its architecture
Another history of the Web from its architectureAnother history of the Web from its architecture
Another history of the Web from its architectureAlexandre Monnin
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...David Wallom
 
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UKThe Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UKAndy Powell
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
The Open Archives Initiative Protocol for Metadata Harvesting
The Open Archives Initiative Protocol for Metadata HarvestingThe Open Archives Initiative Protocol for Metadata Harvesting
The Open Archives Initiative Protocol for Metadata HarvestingAndy Powell
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Open for Business  Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business  Open Archives, OpenURL, RSS and the Dublin Core
Open for Business Open Archives, OpenURL, RSS and the Dublin CoreAndy Powell
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Matthew Hale - Open Source at the Kings Fund
Matthew Hale - Open Source at the Kings FundMatthew Hale - Open Source at the Kings Fund
Matthew Hale - Open Source at the Kings FundTracy Kent
 
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...WGBH Media Library and Archives
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 

Similar to Mind the gap! Reflections on the state of repository data harvesting (20)

Desktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omicsDesktop as a Service supporting Environmental ‘omics
Desktop as a Service supporting Environmental ‘omics
 
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business - Open Archives, OpenURL, RSS and the Dublin Core
Open for Business - Open Archives, OpenURL, RSS and the Dublin Core
 
Building a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native EraBuilding a Distributed File System for the Cloud-Native Era
Building a Distributed File System for the Cloud-Native Era
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
ResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource SynchronizationResourceSync: Web-based Resource Synchronization
ResourceSync: Web-based Resource Synchronization
 
Danis biosystematics2011
Danis biosystematics2011Danis biosystematics2011
Danis biosystematics2011
 
Another history of the Web from its architecture
Another history of the Web from its architectureAnother history of the Web from its architecture
Another history of the Web from its architecture
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...Supporting Research through "Desktop as a Service" models of e-infrastructure...
Supporting Research through "Desktop as a Service" models of e-infrastructure...
 
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UKThe Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK
The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
The Open Archives Initiative Protocol for Metadata Harvesting
The Open Archives Initiative Protocol for Metadata HarvestingThe Open Archives Initiative Protocol for Metadata Harvesting
The Open Archives Initiative Protocol for Metadata Harvesting
 
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
Open for Business  Open Archives, OpenURL, RSS and the Dublin CoreOpen for Business  Open Archives, OpenURL, RSS and the Dublin Core
Open for Business Open Archives, OpenURL, RSS and the Dublin Core
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Matthew Hale - Open Source at the Kings Fund
Matthew Hale - Open Source at the Kings FundMatthew Hale - Open Source at the Kings Fund
Matthew Hale - Open Source at the Kings Fund
 
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...Implementing Samvera Open Source Technology at WGBH and the American Archive ...
Implementing Samvera Open Source Technology at WGBH and the American Archive ...
 
Big data
Big dataBig data
Big data
 
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Dm2 e ontotext-nov2012
Dm2 e ontotext-nov2012Dm2 e ontotext-nov2012
Dm2 e ontotext-nov2012
 

More from Simeon Warner

Questioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the DataQuestioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the DataSimeon Warner
 
OCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation PersistenceOCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation PersistenceSimeon Warner
 
The Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservationThe Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservationSimeon Warner
 
Welcome to the FOLIO Community
Welcome to the FOLIO CommunityWelcome to the FOLIO Community
Welcome to the FOLIO CommunitySimeon Warner
 
Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging Simeon Warner
 
FOLIO and Linked Data
FOLIO and Linked DataFOLIO and Linked Data
FOLIO and Linked DataSimeon Warner
 
IIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateIIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateSimeon Warner
 
Don't bold the field name!
Don't bold the field name!Don't bold the field name!
Don't bold the field name!Simeon Warner
 
Samvera and IIIF 2018
Samvera and IIIF 2018Samvera and IIIF 2018
Samvera and IIIF 2018Simeon Warner
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Simeon Warner
 
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...Simeon Warner
 
Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)Simeon Warner
 
Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)Simeon Warner
 
From Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsFrom Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsSimeon Warner
 
ORCID & other Person iDs
ORCID & other Person iDsORCID & other Person iDs
ORCID & other Person iDsSimeon Warner
 
IIIF without an image server? No problem!
IIIF without an image server? No problem!IIIF without an image server? No problem!
IIIF without an image server? No problem!Simeon Warner
 
IIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateIIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateSimeon Warner
 

More from Simeon Warner (20)

Questioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the DataQuestioning Authority Lookup Service: Linking the Data
Questioning Authority Lookup Service: Linking the Data
 
OCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation PersistenceOCFL: A Shared Approach to Preservation Persistence
OCFL: A Shared Approach to Preservation Persistence
 
The Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservationThe Oxford Common File Layout: A common approach to digital preservation
The Oxford Common File Layout: A common approach to digital preservation
 
Welcome to the FOLIO Community
Welcome to the FOLIO CommunityWelcome to the FOLIO Community
Welcome to the FOLIO Community
 
Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging Sinopia & FOLIO: Bridging the gap to linked data cataloging
Sinopia & FOLIO: Bridging the gap to linked data cataloging
 
FOLIO and Linked Data
FOLIO and Linked DataFOLIO and Linked Data
FOLIO and Linked Data
 
OCFL v1.0
OCFL v1.0OCFL v1.0
OCFL v1.0
 
IIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateIIIF Technical Specification Status Update
IIIF Technical Specification Status Update
 
LKG Editor Dev
LKG Editor DevLKG Editor Dev
LKG Editor Dev
 
Don't bold the field name!
Don't bold the field name!Don't bold the field name!
Don't bold the field name!
 
Samvera and IIIF 2018
Samvera and IIIF 2018Samvera and IIIF 2018
Samvera and IIIF 2018
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)
 
ORCID @ Cornell
ORCID @ CornellORCID @ Cornell
ORCID @ Cornell
 
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
From Open Annotations to W3C Web Annotations (and the impact on IIIF Present...
 
Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)Introduction to the IIIF Presentation API (@SWIB17)
Introduction to the IIIF Presentation API (@SWIB17)
 
Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)Introduction to the International Image Interoperability Framework (IIIF)
Introduction to the International Image Interoperability Framework (IIIF)
 
From Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and CollaborationsFrom Open Access to Open Standards, (Linked) Data and Collaborations
From Open Access to Open Standards, (Linked) Data and Collaborations
 
ORCID & other Person iDs
ORCID & other Person iDsORCID & other Person iDs
ORCID & other Person iDs
 
IIIF without an image server? No problem!
IIIF without an image server? No problem!IIIF without an image server? No problem!
IIIF without an image server? No problem!
 
IIIF Technical Specification Status Update
IIIF Technical Specification Status UpdateIIIF Technical Specification Status Update
IIIF Technical Specification Status Update
 

Recently uploaded

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Mind the gap! Reflections on the state of repository data harvesting

  • 1. Mind the gap! Reflections on the state of repository data harvesting Simeon Warner (Cornell University) http://orcid.org/0000-0002-7970-7855
  • 2. Long long ago, when XML was hard, Unicode was merely one possible character set, a big hard drive was 10GB, and HotBot & AltaVista had a new competitor...
  • 3. ... it was1999 and the UPS meeting in Santa Fe aimed to “... identify technologies to stimulate the adoption of the concept of [Open Access] author self-archived systems in scholarly communication; theorize a framework for the integration of e- print services in the academic document system ...” https://www.openarchives.org/meetings/SantaFe1999/ups-invitation-ori.htm
  • 4. Thus was born OAI-PMH v1.0 2001, v1.1 2002, v2.0 2003
  • 5. OAI-PMH was great! •  It works •  Scales to millions of items •  Easy to implement (good s/w libraries) •  XML, which brought UTF-8 (hurrah!) •  Widely deployed, stable since 2003 (v2.0) •  Registries & validators •  Community & documentation
  • 7.
  • 8. BUT... •  Not RESTful •  Repository-centric •  XML metadata only •  Metadata is wrapped •  Dynamic set membership bug
  • 9. "Currently, OAI-PMH is the only behavior that is uniformly exposed by most repositories. [But], its focus on metadata, its pull- based paradigm, and its technological roots that date back to the web of the nineties put it at odds with ... current web technologies." COAR Next Generation Repositories http://comment.coar-repositories.org/2-next-generation-repositories/
  • 10. Photo by drivethrucafe CC BY-SA https://www.flickr.com/photos/128758398@N07/15836296662
  • 11. Google Scholar is great, but not the answer
  • 12. Replacement with no gap New approach must: •  Meet existing OAI-PMH use cases •  Support content as well as metadata •  Scale better •  Follow web standards •  Be modern, developer friendly
  • 13. Push-me pull-you many items / sources low latency / efficiency => push/notification modest size low barrier => pull
  • 14. Conclusion v1 We, the repository community, need to discuss and agree on a new approach to harvesting
  • 15. ResourceSync ANSI/NISO Z39.99-2017 Sitemaps + •  multiple sets •  fixity •  links •  changes only •  dumps
  • 16. + Notifications (Push) PubSubHubbub WebSub •  low latency •  efficiency
  • 17. CORE >6000 journals >2400 repositories >77M articles (>6M full text) metadata + content
  • 18. Slide from Petr Knoth / CORE – DPLAfest 2017 presentation -- https://goo.gl/vz3zuJ Tested with resync client. 20 x 25MB sitemaps, 1M items ✔
  • 19. IIIF & Europeana •  500,000,000+ IIIF resources – how to find them? •  JSON-LD documents and related web pages •  Europeana experiments with NLW and UCD o  ResourceSync, Sitemaps and native structures
  • 20. Hyku & DPLA •  Extension of HydraSamvera codebase to provide in-the-box repository •  Native ResourceSync support o  Both resource lists and change lists •  Successful harvesting tests with DPLA o  Desire for resource dumps and change dumps for efficiency (see new report: http://hydrainabox.projecthydra.org/2017/06/22/resourcesync.html )
  • 21. Conclusion v2 We, the repository community, should agree on & transition to ResourceSync as the new approach to harvesting
  • 22. Repository prescription •  Metadata and content should be web resources o  stable URIs, follow web standards, not hidden behind query interfaces •  Support ResourceSync as the primary harvesting interface o  OAI-PMH as secondary where necessary •  Distinguish and relate metadata and content entries