Analysing the performance of open access papers discovery tools

Analysing the performance of open
access papers discovery tools
Petr Knoth
Matteo Cancellieri
June 13, 2019 – OR 2019, Hamburg, Germany
CORE
Big Scientific Data and Text Analytics Group
Knowledge Media Institute, The Open University

Why Open Access (OA) Discovery?
• Automating the process of finding a full text of a research paper
• Identifying free copies of paywalled papers
• Reducing the access process to just one-click
• Analysis and monitoring of OA, subscriptions negotiation
• Discovery tools: Browser extensions, system integrations, public
APIs

Search vs Discovery
• Search: Given a query, find relevant papers
• Discovery: Given a document identifier(s), give me the full text

Aims of this work
In scope:
• Quantitatively compare and evaluate OA discovery tools using
widely established information retrieval metrics
• Identify gaps for improvement of OA discovery tools
• Design a tool that maximises performance (CORE Discovery)
Not in scope:
• Discovery beyond freely available content
• Illegal tools

What are OA discovery systems?
Task definition: Given a document identifier
(DOI), give me the URL of a freely accessible
version of the document.

Most successful OA discovery methods
1. Using Crossref as a primary data source and systematically
crawling full text based on Crossref links or other information.
2. Calling a wide range of external APIs in real-time.
We implemented method 1 as a baseline (+ call to CORE in
advance), to understand to what extent are the available
methods better.

How OA Discovery systems work?
Unpaywall OA Button K[.*]io Baseline
Aims to find freely available copies of articles
Help subscribed users access non-OA content
Enriches data to obtain more OA links than already
provided by underlying infrastructures
Builds a database of DOI -> URL mappings
Calls external infrastructure services while serving
users’ request
Disclaimer: It is not possible for us to specify the name of the tool label as K[.*]io. Any potential similarity with an
existing tool on the market is purely coincidental. We make no claims and take no responsibility for any interpretations
that might arise.

Reliance on other infrastructures

Evaluation methodology
• Test all tools on the same data sample (DOIs) and capture the
result
• Query all tools as if they were executed by the user
• Baseline method:
• Collecting links from Crossref and crawling them to find full texts.
• Calling CORE data via CORE API (as a batch prior to execution)
• Evaluation metrics:
• Hit rate - proportion of DOIs for which a URL is returned
• Precision - proportion of true positives, i.e. correctly identified freely
available article copy URLs, in the set of all returned URLs.
• Analysis of the returned results

Data sample
• 100k sample of
DOIs randomly
sampled from
Crossref
• 99% confidence
level a
confidence
interval of
0.41%, i.e. below
1%.

Hit rate with respect to paper publication
year

Precision
• Responding with a URL to a given DOI does not guarantee that
the provided URL leads to a freely available version of the
correct paper.
• We crawl all URLs returned by each tool and test:
• contain the string of the article’s title as recorded in Crossref,
• the text of the resource is the full version of the content
(difficult to automate).Limitation:
overestimates
precision (manual
check needed)
No major differences on
the automated check

Are some tools better for some
disciplines?
No significant
differences
across
disciplines

Pairwise overlap of the returned URLs
Overlap lower
than expected

What hit rate can be achieved if tools are
combined?
We can
improve hit
rate by
combining the
outputs from
multiple
discovery
tools.

Introducing CORE Discovery
• High coverage of freely
available content
• Free service for
researchers by
researchers. No
company controlling the
pipes.
• Best grip on open
repository content.
• Repository integration
• Discovering documents
without a DOI.
https://core.ac.uk/services/discovery/

How CORE Discovery works
• Run a process on a big data cluster merging data from MAG,
Crossref, Unpaywall (2018 dump) and merging with CORE to
find free links in advance.
• Crawling provided links to find full texts.
• If not found, calling EPMC.
• Originally started with:
• OA Button: increased hit-rate but significantly decreased precision.
~32.59% of links discovered by OA Button, which are not discovered by
CORE Discovery and Unpaywall were wrong, based on a manual
check.
• K[.*]io removed the possibility to call API early in 2019. Also not used in
CORE Discovery because of doubts regarding the delivery of many
Researchgate URL links.

Hit rate from Performance of CORE
Discovery
• 10k random sample from Crossref.
CORE Discovery Unpaywall
Not found 7374 7474
Discovered 2626 2526
Hit Rate 26.26% 25.26%

Performance of CORE Discovery
• Manually checked 200
responses where CORE
Discovery and Unpaywall both
returned a URL.
• Precision:
• CORE Discovery: 95.94%
• Unpaywall: 93.4%
CORE Unpaywall
Display page with PDF link 9.64% 5.08%
HTML 7.61% 7.61%
HTML + PDF 3.55% 1.52%
PDF 70.56% 78.17%
PDF in another language 1.02% 1.02%
TOC link 3.55% 0.00%
Dead link 0.51% 0.51%
HTML (abstract only) 0.51% 0.51%
DOI not detected 1.02% 3.55%
Wrong 1.02% 1.02%
Wrong PDF 1.02% 1.02%
Correct 95.94% 93.40%

CORE Discovery Repository integration
• Majority of articles in
repositories metadata only.
• CORE Discovery
repository plugin:
• turns dead ends of user
journeys into journeys
fulfilling users’ information
needs
• makes repository content
more discoverable.

Conclusions
• First study to quantitatively analyse the performance of OA
discovery systems
• We identified:
• Significant differences in the way OA discovery systems operate.
• Strategies that are successful
• Potential for further improvement
• We developed CORE Discovery which offers one-click access
to free copies of research papers whenever you hit the paywall.
• Install CORE Discovery browser extension and/or our repository
plugin.

Acknowledgements
Feedback: CORE Ambassadors, KMI staff, UK Repository
Managers
Lucas Anastasiou
Viktor Yakubiv Harriett Cornish Sergei Misak Nancy Pontika Svetlana Rumyanceva
Samuel Pearce Balviar Notay Chris Biggs Alan Stiles

Thank you!
https://core.ac.uk/services/discovery

Analysing the performance of open access papers discovery tools

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Analysing the performance of open access papers discovery tools

Semelhante a Analysing the performance of open access papers discovery tools (20)

Mais de petrknoth

Mais de petrknoth (20)

Último

Último (20)

Analysing the performance of open access papers discovery tools

Notas do Editor