Data interoperability toolkit (OpenMinTeD)

•Transferir como PPTX, PDF•

0 gostou•126 visualizações

Provide a seamless layer enabling the ingestion and synchronisation of open access research literature to the OpenMinTeD platform.

Ciências

• 1
• 2
• 3
• 4
• 5
• 6
• 7
1
twitter.com/openminted_eu
Presenter: Petr Knoth
Data interoperability toolkit
OpenMinTed Final Review
Task 5.5

2
Task 5.5
Objective
Provide a seamless layer enabling the
ingestion and synchronisation of open
access research literature to the
OpenMinTeD platform.

3
Task 5.5
Overview
1. Harvesting of metadata and
content from repositories

4
Task 5.5
Overview
2. Harvesting of hybrid open access
content from non-standard providers

5
Task 5.5
Overview
3. Providing a seamless layer on top
of open access content using
ResourceSync

6
Task 5.5
Overview
4. Connectors (CORE and
OpenAIRE) to the registry via
OMTD-SHARE

7
Source type Details Number of open access
articles
Repositories and full OA
publishers (OpenAIRE
and CORE)
3,667 data sources
globally harvested using
OAI-PMH
9,033,808
CORE Publisher
Connector
Elsevier 1,191,785
Springer 540,889
Frontiers 65,927
PLoS 179,571
Total publisher
connector
1,978,172
Total Dataset 11,011,980
Knoth, P., Anastasiou, L., Pearce, S. and Pontika, M. (2018) Towards a Global Comprehensive Dataset of Open Access
Papers for Text Analytics, Open Repositories 2018, Bozeman, Montana
Task 5.5
Dataset statistic
as of Jan 2018

8
OpenMinTeD consortium plenary Lausanne
1. A dataset of 11 million+ open access full texts, i.e.
multiple times larger than any other existing legal
downloadable set of Open Access (OA) papers, such
as PubMeD OA subset and arXiv.org
2. First solution for a large-scale aggregation of hybrid-
Gold OA papers from non-standardised systems of
key publishers.
3. First implementation and application of ResourceSync
(Haslhofer et al., 2013 ) that scales to millions of items.
Task 5.5
Highlights

9
• What is a corpus of scientific publications?
• A set of identifiers (hashes calculated from the publications
content with links to metadata) expressed in the OMTD-
SHARE
• Corpuses are guaranteed to be persistent
• How are corpuses created in the registry?
• Federated search over publications in CORE/OpenAIRE
• Results deduplicated based on document hashes exposed by
their APIs (extension of OMTD-SHARE)
• Lazy evaluation on corpus creation
• Where are the resources stored?
• In a distributed object storage system
• How are content resources accessible?
• GET/PUT interface
• Publication - key is the hash
• Metadata – key is a generated filename <source>-<sourceID>-
timestamp.xml
Task 5.5
Key technical decisions 1/2

10
• How is reproducibility achieved?
• Once a corpus is created its data stay forever in the document
storage
• How is it ensured that the same files are not stored many
times?
• Ensured by the hashing mechanism
• How do we ensure that a new corpus does not contain
duplicate resources from CORE/OpenAIRE?
• CORE/OpenAIRE APIs both apply the same hashing function
for content (extension of OMTD-SHARE)
• Results deduplicated in the registry
Task 5.5
Key technical decisions 2/2

11
Task 5.5
Conclusions
• T5.5 tasks fully completed all work set by the DoW.
• All three key components in production
• 1st scalable implementation of ResourceSync
• World’s largest set of OA documents (e.g. more than arXiv and
PubMeD OA) assembled from publishers.
• Feedback of reviewers addressed and integrated
• Future work:
• Continue adding more publishers, testing and maintaining the
service.
• A lot of interest in the connector
• Sustainability of the connector beyond the project lifetime.

Mais conteúdo relacionado

Semelhante a Data interoperability toolkit (OpenMinTeD)

Technical integration of data repositories status and challengesvty

Building Applications using Apache HadoopC4Media

CORE APIv3petrknoth

Hands on kubernetes_container_orchestrationAmir Hossein Sorouri

Swarm UpdatePerforce

Open Archives Initiatives For Metadata HarvestingNikesh Narayanan

Drupal and Apache StanbolAlkuvoima

Hdf5 current futuremfolk

Reproducibility - The myths and truths of pipeline bioinformaticsSimon Cockell

OpenGen webinar 011110OpenGen Alliance

Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Anita de Waard

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and SparkCarolyn Duby

U-Boot community analysisxulioc

Education using FIRE FORGE project

Mid-term Review Meeting - WP5SLOPE Project

Mission to NARs with Apache NiFiHortonworks

Renga: a collaborative data science platformrrrrrok

The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UKAndy Powell

How can repositories support the text-mining of their content and why? Nancy Pontika

COPO - Collaborative Open Plant Omics, by Rob DaveyAIMS (Agricultural Information Management Standards)

Semelhante a Data interoperability toolkit (OpenMinTeD) (20)

Technical integration of data repositories status and challenges

Building Applications using Apache Hadoop

CORE APIv3

Hands on kubernetes_container_orchestration

Swarm Update

Open Archives Initiatives For Metadata Harvesting

Drupal and Apache Stanbol

Hdf5 current future

Reproducibility - The myths and truths of pipeline bioinformatics

OpenGen webinar 011110

Research Object Composer: A Tool for Publishing Complex Data Objects in the C...

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark

U-Boot community analysis

Education using FIRE

Mid-term Review Meeting - WP5

Mission to NARs with Apache NiFi

Renga: a collaborative data science platform

The Open Archives Initiative Protocol for Metadata Harvesting and ePrints UK

How can repositories support the text-mining of their content and why?

COPO - Collaborative Open Plant Omics, by Rob Davey

Mais de petrknoth

Qui Bono? Cumulative advantage in open access publishingpetrknoth

OAI Identifiers: Decentralised PIDs for Research Outputs in Repositoriespetrknoth

UKRI OA policy requirements for repositories and how to meet thempetrknoth

Enabling Educators to LocateHigh-Quality Teaching Resourcespetrknoth

Tracking compliance of the REF2021 policy with the CORE Repository Dashboardpetrknoth

Better together: building services for public good on top of content from the...petrknoth

CORE Analytics Dashboardpetrknoth

Better together: building services for public good on top of content from the...petrknoth

Analysing the performance of open access papers discovery toolspetrknoth

Assessing Compliance with the UK REF 2021 Open Access Policypetrknoth

Integrating research indicators for use in the repositories infrastructure petrknoth

Towards effective research recommender systems for repositoriespetrknoth

COAR Next Generation Repositories WG - Text mining and Recommender system sto...petrknoth

Seamless access to the world’s open access research papers via ResourceSyncpetrknoth

Semantometrics: Towards Fulltext-based Research Evaluationpetrknoth

Aggregating Research papers from Publishers' Systems to Support Text and Data...petrknoth

My repository is being aggregated: a blessing or a curse?petrknoth

FOSTER - Content Delivery (WP3)petrknoth

Towards an Infrastructure for Mining Scientific Publicationspetrknoth

From Open Access Metadata to Open Access Content: Two Principles for Increase...petrknoth

Mais de petrknoth (20)

Qui Bono? Cumulative advantage in open access publishing

OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories

UKRI OA policy requirements for repositories and how to meet them

Enabling Educators to LocateHigh-Quality Teaching Resources

Tracking compliance of the REF2021 policy with the CORE Repository Dashboard

Better together: building services for public good on top of content from the...

CORE Analytics Dashboard

Better together: building services for public good on top of content from the...

Analysing the performance of open access papers discovery tools

Assessing Compliance with the UK REF 2021 Open Access Policy

Integrating research indicators for use in the repositories infrastructure

Towards effective research recommender systems for repositories

COAR Next Generation Repositories WG - Text mining and Recommender system sto...

Seamless access to the world’s open access research papers via ResourceSync

Semantometrics: Towards Fulltext-based Research Evaluation

Aggregating Research papers from Publishers' Systems to Support Text and Data...

My repository is being aggregated: a blessing or a curse?

FOSTER - Content Delivery (WP3)

Towards an Infrastructure for Mining Scientific Publications

From Open Access Metadata to Open Access Content: Two Principles for Increase...

Último

Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav

Natural Polymer Based NanomaterialsAArockiyaNisha

The Philosophy of ScienceUniversity of Hertfordshire

Animal Communication- Auditory and Visual.pptxUmerFayaz5

Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls

Isotopic evidence of long-lived volcanism on IoSérgio Sacani

PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani

Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha

9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha

Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora

Forensic Biology & Its biological significance.pdfrohankumarsinghrore1

Disentangling the origin of chemical differences using GHOSTSérgio Sacani

Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal

Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav

GBSN - Microbiology (Unit 2)Areesha Ahmad

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk

Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823

Botany 4th semester series (krishna).pdfSumit Kumar yadav

Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25

Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari

Data interoperability toolkit (OpenMinTeD)

1. • 1 • 2 • 3 • 4 • 5 • 6 • 7 1 twitter.com/openminted_eu Presenter: Petr Knoth Data interoperability toolkit OpenMinTed Final Review Task 5.5

2. 2 Task 5.5 Objective Provide a seamless layer enabling the ingestion and synchronisation of open access research literature to the OpenMinTeD platform.

3. 3 Task 5.5 Overview 1. Harvesting of metadata and content from repositories

4. 4 Task 5.5 Overview 2. Harvesting of hybrid open access content from non-standard providers

5. 5 Task 5.5 Overview 3. Providing a seamless layer on top of open access content using ResourceSync

6. 6 Task 5.5 Overview 4. Connectors (CORE and OpenAIRE) to the registry via OMTD-SHARE

7. 7 Source type Details Number of open access articles Repositories and full OA publishers (OpenAIRE and CORE) 3,667 data sources globally harvested using OAI-PMH 9,033,808 CORE Publisher Connector Elsevier 1,191,785 Springer 540,889 Frontiers 65,927 PLoS 179,571 Total publisher connector 1,978,172 Total Dataset 11,011,980 Knoth, P., Anastasiou, L., Pearce, S. and Pontika, M. (2018) Towards a Global Comprehensive Dataset of Open Access Papers for Text Analytics, Open Repositories 2018, Bozeman, Montana Task 5.5 Dataset statistic as of Jan 2018

8. 8 OpenMinTeD consortium plenary Lausanne 1. A dataset of 11 million+ open access full texts, i.e. multiple times larger than any other existing legal downloadable set of Open Access (OA) papers, such as PubMeD OA subset and arXiv.org 2. First solution for a large-scale aggregation of hybrid- Gold OA papers from non-standardised systems of key publishers. 3. First implementation and application of ResourceSync (Haslhofer et al., 2013 ) that scales to millions of items. Task 5.5 Highlights

9. 9 • What is a corpus of scientific publications? • A set of identifiers (hashes calculated from the publications content with links to metadata) expressed in the OMTD- SHARE • Corpuses are guaranteed to be persistent • How are corpuses created in the registry? • Federated search over publications in CORE/OpenAIRE • Results deduplicated based on document hashes exposed by their APIs (extension of OMTD-SHARE) • Lazy evaluation on corpus creation • Where are the resources stored? • In a distributed object storage system • How are content resources accessible? • GET/PUT interface • Publication - key is the hash • Metadata – key is a generated filename <source>-<sourceID>- timestamp.xml Task 5.5 Key technical decisions 1/2

10. 10 • How is reproducibility achieved? • Once a corpus is created its data stay forever in the document storage • How is it ensured that the same files are not stored many times? • Ensured by the hashing mechanism • How do we ensure that a new corpus does not contain duplicate resources from CORE/OpenAIRE? • CORE/OpenAIRE APIs both apply the same hashing function for content (extension of OMTD-SHARE) • Results deduplicated in the registry Task 5.5 Key technical decisions 2/2

11. 11 Task 5.5 Conclusions • T5.5 tasks fully completed all work set by the DoW. • All three key components in production • 1st scalable implementation of ResourceSync • World’s largest set of OA documents (e.g. more than arXiv and PubMeD OA) assembled from publishers. • Feedback of reviewers addressed and integrated • Future work: • Continue adding more publishers, testing and maintaining the service. • A lot of interest in the connector • Sustainability of the connector beyond the project lifetime.

Notas do Editor

Achieving interoperability across publishers at the level of files (the publisher connector intentionally does not parse nor understand the different metadata formats of publishers, these will only be interpreted by aggregators like CORE/OpenAIRE) Lack of an adopted common API approach for harvesting across publishers (e.g. like OAI-PMH across repositories) Different mechanisms for flagging OA content Consistent provision of full text links in metadata (including in CrossRef TDM) Lack of support for discovery of new content Technical (and also legal) issues around systematic full text aggregation from publishers Full text harvesting/crawling limits in place on publisher endpoints Lack of documentation on publisher systems
Achieving interoperability across publishers at the level of files (the publisher connector intentionally does not parse nor understand the different metadata formats of publishers, these will only be interpreted by aggregators like CORE/OpenAIRE) Lack of an adopted common API approach for harvesting across publishers (e.g. like OAI-PMH across repositories) Different mechanisms for flagging OA content Consistent provision of full text links in metadata (including in CrossRef TDM) Lack of support for discovery of new content Technical (and also legal) issues around systematic full text aggregation from publishers Full text harvesting/crawling limits in place on publisher endpoints Lack of documentation on publisher systems
Reasons to adopt ResourceSync for this task: - Very large dataset with an ongoing stream of content. OAI-PMH fails in these situations. Updates need to be properly addresses and synchronised quickly. Enable CORE/OpenAIRE to ingest content via ResourceSync, thus making it possible for CORE/OpenAIRE to encourage also repositories to start replacing their old OAI-PMH ingestion mechanisms with more efficient ResourceSync mechanisms. To achieve the desired functionality, we need to: - Develop a webserver on top of the ResourceSync implementation developed at DANS Adopting the logic for the generation of ChangeLists so changes don’t have to be detected, but are fed directly form the Publisher Connector ingestion mechanisms
Reasons to adopt ResourceSync for this task: - Very large dataset with an ongoing stream of content. OAI-PMH fails in these situations. Updates need to be properly addresses and synchronised quickly. Enable CORE/OpenAIRE to ingest content via ResourceSync, thus making it possible for CORE/OpenAIRE to encourage also repositories to start replacing their old OAI-PMH ingestion mechanisms with more efficient ResourceSync mechanisms. To achieve the desired functionality, we need to: - Develop a webserver on top of the ResourceSync implementation developed at DANS Adopting the logic for the generation of ChangeLists so changes don’t have to be detected, but are fed directly form the Publisher Connector ingestion mechanisms

Data interoperability toolkit (OpenMinTeD)

Recomendados

Recomendados

Mais conteúdo relacionado

Semelhante a Data interoperability toolkit (OpenMinTeD)

Semelhante a Data interoperability toolkit (OpenMinTeD) (20)

Mais de petrknoth

Mais de petrknoth (20)

Último

Último (20)

Data interoperability toolkit (OpenMinTeD)

Notas do Editor