O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Data interoperability toolkit (OpenMinTeD)

55 visualizações

Publicada em

Provide a seamless layer enabling the ingestion and synchronisation of open access research literature to the OpenMinTeD platform.

Publicada em: Ciências
  • Seja o primeiro a comentar

  • Seja a primeira pessoa a gostar disto

Data interoperability toolkit (OpenMinTeD)

  1. 1. • 1 • 2 • 3 • 4 • 5 • 6 • 7 1 twitter.com/openminted_eu Presenter: Petr Knoth Data interoperability toolkit OpenMinTed Final Review Task 5.5
  2. 2. 2 Task 5.5 Objective Provide a seamless layer enabling the ingestion and synchronisation of open access research literature to the OpenMinTeD platform.
  3. 3. 3 Task 5.5 Overview 1. Harvesting of metadata and content from repositories
  4. 4. 4 Task 5.5 Overview 2. Harvesting of hybrid open access content from non-standard providers
  5. 5. 5 Task 5.5 Overview 3. Providing a seamless layer on top of open access content using ResourceSync
  6. 6. 6 Task 5.5 Overview 4. Connectors (CORE and OpenAIRE) to the registry via OMTD-SHARE
  7. 7. 7 Source type Details Number of open access articles Repositories and full OA publishers (OpenAIRE and CORE) 3,667 data sources globally harvested using OAI-PMH 9,033,808 CORE Publisher Connector Elsevier 1,191,785 Springer 540,889 Frontiers 65,927 PLoS 179,571 Total publisher connector 1,978,172 Total Dataset 11,011,980 Knoth, P., Anastasiou, L., Pearce, S. and Pontika, M. (2018) Towards a Global Comprehensive Dataset of Open Access Papers for Text Analytics, Open Repositories 2018, Bozeman, Montana Task 5.5 Dataset statistic as of Jan 2018
  8. 8. 8 OpenMinTeD consortium plenary Lausanne 1. A dataset of 11 million+ open access full texts, i.e. multiple times larger than any other existing legal downloadable set of Open Access (OA) papers, such as PubMeD OA subset and arXiv.org 2. First solution for a large-scale aggregation of hybrid- Gold OA papers from non-standardised systems of key publishers. 3. First implementation and application of ResourceSync (Haslhofer et al., 2013 ) that scales to millions of items. Task 5.5 Highlights
  9. 9. 9 • What is a corpus of scientific publications? • A set of identifiers (hashes calculated from the publications content with links to metadata) expressed in the OMTD- SHARE • Corpuses are guaranteed to be persistent • How are corpuses created in the registry? • Federated search over publications in CORE/OpenAIRE • Results deduplicated based on document hashes exposed by their APIs (extension of OMTD-SHARE) • Lazy evaluation on corpus creation • Where are the resources stored? • In a distributed object storage system • How are content resources accessible? • GET/PUT interface • Publication - key is the hash • Metadata – key is a generated filename <source>-<sourceID>- timestamp.xml Task 5.5 Key technical decisions 1/2
  10. 10. 10 • How is reproducibility achieved? • Once a corpus is created its data stay forever in the document storage • How is it ensured that the same files are not stored many times? • Ensured by the hashing mechanism • How do we ensure that a new corpus does not contain duplicate resources from CORE/OpenAIRE? • CORE/OpenAIRE APIs both apply the same hashing function for content (extension of OMTD-SHARE) • Results deduplicated in the registry Task 5.5 Key technical decisions 2/2
  11. 11. 11 Task 5.5 Conclusions • T5.5 tasks fully completed all work set by the DoW. • All three key components in production • 1st scalable implementation of ResourceSync • World’s largest set of OA documents (e.g. more than arXiv and PubMeD OA) assembled from publishers. • Feedback of reviewers addressed and integrated • Future work: • Continue adding more publishers, testing and maintaining the service. • A lot of interest in the connector • Sustainability of the connector beyond the project lifetime.