Open Access discovery tools aim to locate freely available copies of research papers which might be behind the paywall on a publisher’s website. Our study provides a large scale quantitative performance comparison of several OA discovery tools on a randomly selected sample of 100k DOIs from CrossRef. We use the acquired knowledge from this analysis to build a new discovery tool - CORE Discovery.
Analysing the performance of open access papers discovery tools
1. Analysing the performance of open
access papers discovery tools
Petr Knoth
Matteo Cancellieri
June 13, 2019 – OR 2019, Hamburg, Germany
CORE
Big Scientific Data and Text Analytics Group
Knowledge Media Institute, The Open University
2. Why Open Access (OA) Discovery?
• Automating the process of finding a full text of a research paper
• Identifying free copies of paywalled papers
• Reducing the access process to just one-click
• Analysis and monitoring of OA, subscriptions negotiation
• Discovery tools: Browser extensions, system integrations, public
APIs
3. Search vs Discovery
• Search: Given a query, find relevant papers
• Discovery: Given a document identifier(s), give me the full text
4. Aims of this work
In scope:
• Quantitatively compare and evaluate OA discovery tools using
widely established information retrieval metrics
• Identify gaps for improvement of OA discovery tools
• Design a tool that maximises performance (CORE Discovery)
Not in scope:
• Discovery beyond freely available content
• Illegal tools
5. What are OA discovery systems?
Task definition: Given a document identifier
(DOI), give me the URL of a freely accessible
version of the document.
6. Most successful OA discovery methods
1. Using Crossref as a primary data source and systematically
crawling full text based on Crossref links or other information.
2. Calling a wide range of external APIs in real-time.
We implemented method 1 as a baseline (+ call to CORE in
advance), to understand to what extent are the available
methods better.
7. How OA Discovery systems work?
Unpaywall OA Button K[.*]io Baseline
Aims to find freely available copies of articles
Help subscribed users access non-OA content
Enriches data to obtain more OA links than already
provided by underlying infrastructures
Builds a database of DOI -> URL mappings
Calls external infrastructure services while serving
users’ request
Disclaimer: It is not possible for us to specify the name of the tool label as K[.*]io. Any potential similarity with an
existing tool on the market is purely coincidental. We make no claims and take no responsibility for any interpretations
that might arise.
9. Evaluation methodology
• Test all tools on the same data sample (DOIs) and capture the
result
• Query all tools as if they were executed by the user
• Baseline method:
• Collecting links from Crossref and crawling them to find full texts.
• Calling CORE data via CORE API (as a batch prior to execution)
• Evaluation metrics:
• Hit rate - proportion of DOIs for which a URL is returned
• Precision - proportion of true positives, i.e. correctly identified freely
available article copy URLs, in the set of all returned URLs.
• Analysis of the returned results
10. Data sample
• 100k sample of
DOIs randomly
sampled from
Crossref
• 99% confidence
level a
confidence
interval of
0.41%, i.e. below
1%.
13. Precision
• Responding with a URL to a given DOI does not guarantee that
the provided URL leads to a freely available version of the
correct paper.
• We crawl all URLs returned by each tool and test:
• contain the string of the article’s title as recorded in Crossref,
• the text of the resource is the full version of the content
(difficult to automate).Limitation:
overestimates
precision (manual
check needed)
No major differences on
the automated check
14. Are some tools better for some
disciplines?
No significant
differences
across
disciplines
16. What hit rate can be achieved if tools are
combined?
We can
improve hit
rate by
combining the
outputs from
multiple
discovery
tools.
17. Introducing CORE Discovery
• High coverage of freely
available content
• Free service for
researchers by
researchers. No
company controlling the
pipes.
• Best grip on open
repository content.
• Repository integration
• Discovering documents
without a DOI.
https://core.ac.uk/services/discovery/
18. How CORE Discovery works
• Run a process on a big data cluster merging data from MAG,
Crossref, Unpaywall (2018 dump) and merging with CORE to
find free links in advance.
• Crawling provided links to find full texts.
• If not found, calling EPMC.
• Originally started with:
• OA Button: increased hit-rate but significantly decreased precision.
~32.59% of links discovered by OA Button, which are not discovered by
CORE Discovery and Unpaywall were wrong, based on a manual
check.
• K[.*]io removed the possibility to call API early in 2019. Also not used in
CORE Discovery because of doubts regarding the delivery of many
Researchgate URL links.
20. Hit rate from Performance of CORE
Discovery
• 10k random sample from Crossref.
CORE Discovery Unpaywall
Not found 7374 7474
Discovered 2626 2526
Hit Rate 26.26% 25.26%
21. Performance of CORE Discovery
• Manually checked 200
responses where CORE
Discovery and Unpaywall both
returned a URL.
• Precision:
• CORE Discovery: 95.94%
• Unpaywall: 93.4%
CORE Unpaywall
Display page with PDF link 9.64% 5.08%
HTML 7.61% 7.61%
HTML + PDF 3.55% 1.52%
PDF 70.56% 78.17%
PDF in another language 1.02% 1.02%
TOC link 3.55% 0.00%
Dead link 0.51% 0.51%
HTML (abstract only) 0.51% 0.51%
DOI not detected 1.02% 3.55%
Wrong 1.02% 1.02%
Wrong PDF 1.02% 1.02%
Correct 95.94% 93.40%
22. CORE Discovery Repository integration
• Majority of articles in
repositories metadata only.
• CORE Discovery
repository plugin:
• turns dead ends of user
journeys into journeys
fulfilling users’ information
needs
• makes repository content
more discoverable.
23. Conclusions
• First study to quantitatively analyse the performance of OA
discovery systems
• We identified:
• Significant differences in the way OA discovery systems operate.
• Strategies that are successful
• Potential for further improvement
• We developed CORE Discovery which offers one-click access
to free copies of research papers whenever you hit the paywall.
• Install CORE Discovery browser extension and/or our repository
plugin.
24. Acknowledgements
Feedback: CORE Ambassadors, KMI staff, UK Repository
Managers
Lucas Anastasiou
Viktor Yakubiv Harriett Cornish Sergei Misak Nancy Pontika Svetlana Rumyanceva
Samuel Pearce Balviar Notay Chris Biggs Alan Stiles
Highest coverage of freely available content. Our tests have shown CORE Discovery finding more free content than any other discovery system.
Free service for researchers by researchers. CORE Discovery is the only free content discovery extension developed by researchers for researchers. There is no major publisher or enterprise controlling and profiting from your usage data.
Best grip on open repository content. Due to CORE being a leader in harvesting open access literature, CORE Discovery has the best grip on open content from open repositories as opposed to other services that disproportionately focus only on content indexed in major commercial databases.
Repository integration and discovering documents without a DOI. The only service offering seamless and free integration into repositories. CORE Discovery is also the only discovery system that can locate scientific content even for items with an unknown DOI or which do not have a DOI.
Open access discovery tools locate freely available copies of research papers which might be behind the paywall
K[.*]io