1. @openaire_euOpenAIRE-Connect Review
23rd of April, 2018 - Brussels
The OpenAIRE Research Graph
Bringing scholarly communication back into the
hands of scientists
PaoloManghi
InstituteofInformationScienceandTechnologies
ConsiglioNazionaledelleRicerche
2. Materializing the Open Science Graph
Project
communit
y
FunderFunding
Product
Publicatio
n
Researc
h Data
Software
Organizatio
n
Source
Other
res.
products
Mining
Deduplication
End-user feedback
Scientific product
catalogue
Harvesting
GUIDE
LINES
Research Infrastructures Publishing
IT
OpenAIREAdvance1stReview|Luxembourg|10Oct2019
3. Providing an open metadata
research graph of interlinked
scientific products, with Open
Access information, linked to
funding information and research
communities
The OpenAIRE research graph
Open
Complete
De-duplicated
Transparent
Participatory
Decentralized
Trusted
4. De-duplicated
More information about the de-duplication framework used by OpenAIRE can be found
searching on Zenodo for :
• “De-duplicating the OpenAIRE Scholarly Communication Big Graph” (poster)
• “GDup: De-Duplication of Scholarly Communication Big Graphs”
Metadata records
corresponding to equivalent
objects are merged
Scientific products
Organizations
6. • Rely on quality scholarly
communication sources of
different kinds
Participatory
• Include solutions and content
from any interested and known
content provider in scholarly
communication
Institutional repositories
Aggregators
Data archives
Software repositories
Research infrastructure sources
Funder grant databases
Authors & Orgs entity registries
Publishers & journals
7. • Metadata in the graph includes provenance when harvested
and reliability indicators when obtained from mining
Transparent
8. • Preservation and ownership beyond OpenAIRE
Exchanged with other graph initiatives
Broker Service: Redistributed via subscription and
notification to contributing data sources
(provide.openaire.eu)
• Openly accessible via APIs
(develop.openaire.eu)
Decentralized
9. • Authors in the loop to enrich their ORCID record
• Validation of end-user ”claims”
Trusted (November 2019)
11. Harvesting: Revised Classification of Research
Products
Publications
• Article
• Preprint
• Report
• …
Datasets
• Dataset
• Collection
• Clinical Trials
• …
Software
• Research
Software
• …
Other Research
Products
• Service
• Workflow
• Interactive
Resource
• …
Institutional/
publication
repositories
Journals/
publishers
Data
repositories
Other
Products
repositories
Software
repositories
Workshop Técnico OpenAIRE / LA Referencia | 29-30 October, 2019 | Costa Rica
12. Open Science publishing
Bridging RIs and Scholarly Communication
Transparency and reproducibility
e-Infrastructures and
Research Infrastructures
Scholarly Communication
infrastructure
Dataset
Method Thematic
Service
Dataset
Experiment Publishing
the experiment
Input
Dataset
Input
Method
Output
Dataset
Experiment
product
Thematic Service
Parameters
Experiment
repo
Research data,
Software,
Workflows,
Publications
Data repo
Method repo
Publications
IT
Harvesting
OpenAIREAdvance1stReview|Luxembourg|10Oct2019
13. • EPOS Research Infrastructure
Reproducibility
Transparency
Seamless publishing
Open Science publishing workflows
OpenAIREAdvance1stReview|Luxembourg|10Oct2019
14. Pre-processed sources
Article-dataset links
480Mi links
CrossRef enriched
85Mi publication records
DOIBoost
Academic Graph
Published every 6 months
(new versions to be published next week)
OpenAIREAdvance1stReview|Luxembourg|10Oct2019
16. Production: Open Access CAPs
BETA: Open Science CAPs
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
80000000
90000000
100000000
Old CAP New CAP
literature
0
2000000
4000000
6000000
8000000
10000000
12000000
Old CAP New CAP
research data
0
20000
40000
60000
80000
100000
120000
140000
Old CAP New CAP
software
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
4500000
Old CAP New CAP
other
110Mi
30Mi
1Mi
10Mi
100K
180K
3Mi
7.5Mi
Harvested content
• Data sources
10K +
• Records
~480Mi
• Publication full-texts
~12Mi (Springer N. coming)
• Links (also text-mined)
~960Mi
PROD BETA PROD BETA
PROD BETAPROD BETA
OpenAIREAdvance1stReview|Luxembourg|10Oct2019
17. Microsoft Research (being drafted)
Unpaywall (ongoing)
ORCID membership (November 2019)
RDA IG Open Science Graphs for FAIR Data
FREYA, ResearchGraph, OpenCitations,
Open Knowledge Research Graph
IG Session at RDA Helsinki 2019 (15th of October 2019)
Liaisons
Academic Graph
OpenAIREAdvance1stReview|Luxembourg|10Oct2019
18. • October-November 2019:
OpenAIRE Research Graph open for consultation
Collecting feedback via Trello (operational end of September)
• December 2019:
OpenAIRE Research Graph
in production
BETA Graph Open Consultation
http://beta.explore.openaire.eu
OpenAIREAdvance1stReview|Luxembourg|10Oct2019
23. Task 9.1. System administration -
infrastructure: before Jan 2018
Public
System
20srv
122CPU
320GB
8TB
Mining
System
21srv
406CPU
2TB
385TB
Data provision
System
23srv
154CPU
430GB
23TB
Testing
System
5srv
30CPU
100GB
3TB
Public
System
44srv
274CPU
905GB
20TB
Mining
System
22srv
414CPU
2.2TB
388TB
Data provision
System
23srv
154CPU
430GB
24TB
Testing
System
14srv
86CPU
302GB
9TB
OpenAIREAdvance1stReview|Luxembourg|10Oct2019
Notas do Editor
How does OpenAIRE materializes the graph?
Collection records (dedup)
Collection full-texts for OA publications
Mining full-texts of publications to find links to data, software, other product, projects, research communities and infrastructures and enhance metadata with affiliation information, subjects/keywords: article-data and data-data links are around 120 Mi, article-article similarity links are around 300Mi
GOAL: High quality open graph for
Open (because it must be), Complete (all «trusted»/known sources), Deduplicated (must be disambiguated for statistics), transparent (provenance), participatory (not a closed network), decentralised (ownership and redistribution), trusted (manual curation)
Supported entity types
People to come with orcid collaboration
Algorithm can be improved but some cases can be handled only manually
Any interested content provider can join the network to provide content. Not a closed network. Interoperability guidelines help in the process.
Mining trust: probability of the mining information to be correct
OpenAIRE DOES NOT own the graph
Supported entity types
People to come with orcid collaboration
Algorithm can be improved but some cases can be handled only manually
In production today we acquire content according to Open Access-driven CAP: 30 mi pubs with links to other objects (e.g. 1Mi datasets, etc)
In BETA we acquire content according to Open Science-driven CAP: this means we collect EVERYTHING (that is in a trusted source) menaing also non-OA content