Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Data interoperability toolkit (OpenMinTeD)
1. • 1
• 2
• 3
• 4
• 5
• 6
• 7
1
twitter.com/openminted_eu
Presenter: Petr Knoth
Data interoperability toolkit
OpenMinTed Final Review
Task 5.5
2. 2
Task 5.5
Objective
Provide a seamless layer enabling the
ingestion and synchronisation of open
access research literature to the
OpenMinTeD platform.
7. 7
Source type Details Number of open access
articles
Repositories and full OA
publishers (OpenAIRE
and CORE)
3,667 data sources
globally harvested using
OAI-PMH
9,033,808
CORE Publisher
Connector
Elsevier 1,191,785
Springer 540,889
Frontiers 65,927
PLoS 179,571
Total publisher
connector
1,978,172
Total Dataset 11,011,980
Knoth, P., Anastasiou, L., Pearce, S. and Pontika, M. (2018) Towards a Global Comprehensive Dataset of Open Access
Papers for Text Analytics, Open Repositories 2018, Bozeman, Montana
Task 5.5
Dataset statistic
as of Jan 2018
8. 8
OpenMinTeD consortium plenary Lausanne
1. A dataset of 11 million+ open access full texts, i.e.
multiple times larger than any other existing legal
downloadable set of Open Access (OA) papers, such
as PubMeD OA subset and arXiv.org
2. First solution for a large-scale aggregation of hybrid-
Gold OA papers from non-standardised systems of
key publishers.
3. First implementation and application of ResourceSync
(Haslhofer et al., 2013 ) that scales to millions of items.
Task 5.5
Highlights
9. 9
• What is a corpus of scientific publications?
• A set of identifiers (hashes calculated from the publications
content with links to metadata) expressed in the OMTD-
SHARE
• Corpuses are guaranteed to be persistent
• How are corpuses created in the registry?
• Federated search over publications in CORE/OpenAIRE
• Results deduplicated based on document hashes exposed by
their APIs (extension of OMTD-SHARE)
• Lazy evaluation on corpus creation
• Where are the resources stored?
• In a distributed object storage system
• How are content resources accessible?
• GET/PUT interface
• Publication - key is the hash
• Metadata – key is a generated filename <source>-<sourceID>-
timestamp.xml
Task 5.5
Key technical decisions 1/2
10. 10
• How is reproducibility achieved?
• Once a corpus is created its data stay forever in the document
storage
• How is it ensured that the same files are not stored many
times?
• Ensured by the hashing mechanism
• How do we ensure that a new corpus does not contain
duplicate resources from CORE/OpenAIRE?
• CORE/OpenAIRE APIs both apply the same hashing function
for content (extension of OMTD-SHARE)
• Results deduplicated in the registry
Task 5.5
Key technical decisions 2/2
11. 11
Task 5.5
Conclusions
• T5.5 tasks fully completed all work set by the DoW.
• All three key components in production
• 1st scalable implementation of ResourceSync
• World’s largest set of OA documents (e.g. more than arXiv and
PubMeD OA) assembled from publishers.
• Feedback of reviewers addressed and integrated
• Future work:
• Continue adding more publishers, testing and maintaining the
service.
• A lot of interest in the connector
• Sustainability of the connector beyond the project lifetime.
Notas do Editor
Achieving interoperability across publishers at the level of files (the publisher connector intentionally does not parse nor understand the different metadata formats of publishers, these will only be interpreted by aggregators like CORE/OpenAIRE)
Lack of an adopted common API approach for harvesting across publishers (e.g. like OAI-PMH across repositories)
Different mechanisms for flagging OA content
Consistent provision of full text links in metadata (including in CrossRef TDM)
Lack of support for discovery of new content
Technical (and also legal) issues around systematic full text aggregation from publishers
Full text harvesting/crawling limits in place on publisher endpoints
Lack of documentation on publisher systems
Achieving interoperability across publishers at the level of files (the publisher connector intentionally does not parse nor understand the different metadata formats of publishers, these will only be interpreted by aggregators like CORE/OpenAIRE)
Lack of an adopted common API approach for harvesting across publishers (e.g. like OAI-PMH across repositories)
Different mechanisms for flagging OA content
Consistent provision of full text links in metadata (including in CrossRef TDM)
Lack of support for discovery of new content
Technical (and also legal) issues around systematic full text aggregation from publishers
Full text harvesting/crawling limits in place on publisher endpoints
Lack of documentation on publisher systems
Reasons to adopt ResourceSync for this task:
- Very large dataset with an ongoing stream of content. OAI-PMH fails in these situations. Updates need to be properly addresses and synchronised quickly.
Enable CORE/OpenAIRE to ingest content via ResourceSync, thus making it possible for CORE/OpenAIRE to encourage also repositories to start replacing their old OAI-PMH ingestion mechanisms with more efficient ResourceSync mechanisms.
To achieve the desired functionality, we need to:
- Develop a webserver on top of the ResourceSync implementation developed at DANS
Adopting the logic for the generation of ChangeLists so changes don’t have to be detected, but are fed directly form the Publisher Connector ingestion mechanisms
Reasons to adopt ResourceSync for this task:
- Very large dataset with an ongoing stream of content. OAI-PMH fails in these situations. Updates need to be properly addresses and synchronised quickly.
Enable CORE/OpenAIRE to ingest content via ResourceSync, thus making it possible for CORE/OpenAIRE to encourage also repositories to start replacing their old OAI-PMH ingestion mechanisms with more efficient ResourceSync mechanisms.
To achieve the desired functionality, we need to:
- Develop a webserver on top of the ResourceSync implementation developed at DANS
Adopting the logic for the generation of ChangeLists so changes don’t have to be detected, but are fed directly form the Publisher Connector ingestion mechanisms