Scholarly citations from one publication to another, expressed as reference lists within academic articles, are core elements of scholarly communication. Unfortunately, they usually can be accessed en masse only by paying significant subscription fees to commercial organizations, while those few services that do made them available for free impose strict limitations on their reuse. In this paper we provide an overview of the OpenCitations Project (http://opencitations.net) undertaken to remedy this situation, and of its main product, the OpenCitations Corpus, which is an open repository of accurate bibliographic citation data harvested from the scholarly literature, made available in RDF under a Creative Commons public domain dedication.
Paper at: https://w3id.org/oc/paper/occ-lisc2016.html
Recombinant DNA technology (Immunological screening)
Freedom for bibliographic references: OpenCitations arise
1. Freedom for bibliographic references:
OpenCitations arise
Silvio Peroni, David Shotton, Fabio Vitali
4th International Workshop on
Linked Data for Information Extraction (LD4IE 2016)
Kobe, Japan, October 18, 2016
https://w3id.org/oc/paper/occ-lisc2016.html
2. The Venice analogy
• Island =
scholarly publication
• Bridge = citation
• Current situation:
– local travel to
the next island
is permitted
– unrestricted
travel over the
entire network
of bridges
requires an
expensive
season ticket
– general
populace is
excluded
https://w3id.org/oc/paper/the-venice-analogy.html
3. Opening the bridges
• What – Citation data are one of the main tools used by
researchers to gain knowledge about particular topics, and
they also serve institutional goals, for example in research
assessment
• Problem – The most authoritative databases of citation data,
Scopus and Web of Science, can only be accessed by paying
significant annual access fees
– The University of Bologna pays about 6,000,000 euros per year for
accessing to digital bibliographic resources
• Solution – To create a citation database that freely and legally
makes available citation data in an open repository to assist
scholars with their academic studies and serve knowledge to
the wider public
4. OpenCitations
• The OpenCitations Project aims at creating an open repository of
scholarly citation data – the OpenCitations Corpus (OCC) – made
available under a Creative Commons public domain dedication to
provide in RDF accurate citation information (bibliographic
references) harvested from the scholarly literature
– All scripts are released with Open Source ISC Licence and available on
GitHub at http://github.com/essepuntato/opencitations
• Currently processing papers available in the PubMedCentral Open
Access subset (which contains paper related to the medical,
biological, life science domains) by means of the Europe
PubMedCentral API
• As of October 17, 2016 the OCC contains
– 1,311,196 citing/cited bibliographic resources
– 1,584,945 citation links
http://opencitations.net
5. OpenCitations Ontology
• The OpenCitations Ontology
(OCO) groups existing
complementary ontological
entities from several other
ontologies for the purpose of
providing descriptive metadata
for the OCC
• SPAR Ontologies reused:
– FRBR-aligned Bibliographic
Ontology (FaBiO) http://
purl.org/spar/fabio)
– Publishing Roles Ontology
(PRO, http://purl.org/
spar/pro)
– Bibliographic Reference
Ontology (BiRO, http://
purl.org/spar/biro)
– Citation Counting and Context
Characterization Ontology
(C4O, http://purl.org/
spar/c4o)
– DataCite Ontology (http://
purl.org/spar/datacite)
6. OpenCitations Corpus
• Six distinct kinds of bibliographic entities
– bibliographic resources (citing/cited articles, journals, books, proceedings, etc.)
– resource embodiments (format information about bibliographic resources)
– bibliographic entries (literal textual entries occurring in the reference lists)
– responsible agents (agents having certain roles with respect to the bibliographic
resources)
– agent roles (author, editor, publisher);
– identifiers (DOI, ORCID, PubMedID, URL, etc.)
• Provenance for each entity handled by means of PROV-O – as described in the
Drift-a-LOD 2016 (a workshop held in Bologna next month during EKAW 2016)
paper available at
https://w3id.org/oc/paper/occ-driftalod2016.html
• Access the OCC via
– HTTP (content negotiation, formats: JSON-LD, RDF/XML, Trig, HTML),
e.g. https://w3id.org/oc/corpus/br/1
– SPARQL endpoint, available at https://w3id.org/oc/sparql
– dumps, downloadable at https://opencitations.net/download
7. Ingestion workflow
BEE
EuropeanPubMedCentralProcessor
Parsing the
XML source of
PubMed Central
Open Access
articles.
1
SPACIN
Producing
JSON with DOI
and bib entries.
{
"doi": "10.1590/1414-431x20154655",
"localid": "MED-26577845",
"curator": "BEE EuropeanPubMedCentralProcessor",
"source": "http://www.ebi.ac.uk/europepmc/webservices/rest/PMC4678653/
fullTextXML",
"source_provider": "Europe PubMed Central",
"pmid": "26577845",
"pmcid": “PMC4678653",
"references": [
{
"bibentry": "Wenger, NK. Coronary heart disease: an older woman's major
health risk, BMJ, 1997, 315, 1085, 1090, DOI: 10.1136/bmj.315.7115.1085, PMID:
9366743",
"pmid": "9366743",
"doi": "10.1136/bmj.315.7115.1085",
"pmcid": "PMC2127693",
"process_entry": "True"
} …
]
}
2
For each citing/cited resource,
if an ID (DOI, PMID, PMCID) is
specified check if the resource
exists already. If it does go to 5.
store
ResourceFinder
3
GraphSet
ProvSet
DatasetHandler
Storer
Load all the statements onthe triplestore and storethem in the file system for
easy recovering.
OCC
6
If the resource doesn’t exist,
extract possible IDs from the entry
and query CrossRef and ORCID.
CrossRefProcessor
ORCIDProcessor
4
GraphEntity
New metadata resources are created.
If CrossRef/ORCID returned something, all
the related metadata will be used,
otherwise only basic metadata (IDs and
entries) will be added.
5
8. Test
• Hardware: MacBook Pro, with 2 GHz Intel Core i7 processor, 8 GB
DDR3 1600 MHz, OS X 10.11.3
• BEE: running for 30 minutes (querying Europe PubMedCentral API),
produced 185 JSON files (~6 new JSON files per minute)
• SPACIN
– 45 minutes to process all BEE JSON files related to the 67 papers in the
ISWC 2015 Proceedings (sources kindly made available by Springer-Nature)
– 210 minutes to process BEE JSON files related to 67 papers from Europe
PubMed Central (OA subset)
All these data are available on Figshare – their URLs is included in the article.
9. ISWC2015: most cited papers
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX fabio: <http://purl.org/spar/fabio/>
PREFIX cito: <http://purl.org/spar/cito/>
SELECT ?cited ?title ?tot {
{ SELECT ?cited (count(?citing) as ?tot) { ?cited a fabio:Expression ; ^cito:cites ?citing }
GROUP BY ?cited }
OPTIONAL { ?cited dcterms:title ?title } } ORDER BY DESC(?tot) LIMIT 15
no title?
10. No Crossref metadata
PREFIX biro: <http://purl.org/spar/biro/>
PREFIX c4o: <http://purl.org/spar/c4o/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX frbr: <http://purl.org/vocab/frbr/core#>
SELECT ?citing ?entry {
<http://localhost:8000/corpus/br/1302> ^biro:references ?ref .
?ref c4o:hasContent ?entry ; ^frbr:part ?citing
}
How the “no title” paper has
been referenced in the 4
papers citing it
SPACIN used the URL in the textual entries
(i.e. “http://www.w3.org/DesignIssues/LinkedData.html”)
to associate them to the same bibliographic resource:
<http://localhost:8000/corpus/br/1302>
11. Conclusions
• We have introduced the OpenCitations Project, which has created an open
repository of accurate bibliographic references harvested from the scholarly
literature, i.e. the OpenCitations Corpus (OCC)
• The number of citation links is growing day by day (about 25,000 new citation
links per day) as the continuous workflow adds new data dynamically from
Europe PubMedCentral (and other authoritative sources, i.e. Crossref and
ORCID)
• First adopter: Wikidata (via WikiCite)
– The Wikidata community has created a property for associating the OCC bibliographic
resource identifier to the metadata about scholarly papers in Wikidata
– Several links from Wikidata to the OCC have been already added
• Future plans: developing tools for linking the resources within the OCC with those
included in other datasets, e.g. Wikidata, Scholarly Data, Springer LOD
• Don’t hesitate to poke me during the poster and demo session on Wednesday
(panel P30) for additional details about OpenCitations – and don’t forgot to vote
for it, of course :-)
12. Thanks for your attention
Silvio Peroni, David Shotton, Fabio Vitali
4th International Workshop on
Linked Data for Information Extraction (LD4IE 2016)
Kobe, Japan, October 18, 2016