SlideShare uma empresa Scribd logo
1 de 25
Analysing the performance of open
access papers discovery tools
Petr Knoth
Matteo Cancellieri
June 13, 2019 – OR 2019, Hamburg, Germany
CORE
Big Scientific Data and Text Analytics Group
Knowledge Media Institute, The Open University
Why Open Access (OA) Discovery?
• Automating the process of finding a full text of a research paper
• Identifying free copies of paywalled papers
• Reducing the access process to just one-click
• Analysis and monitoring of OA, subscriptions negotiation
• Discovery tools: Browser extensions, system integrations, public
APIs
Search vs Discovery
• Search: Given a query, find relevant papers
• Discovery: Given a document identifier(s), give me the full text
Aims of this work
In scope:
• Quantitatively compare and evaluate OA discovery tools using
widely established information retrieval metrics
• Identify gaps for improvement of OA discovery tools
• Design a tool that maximises performance (CORE Discovery)
Not in scope:
• Discovery beyond freely available content
• Illegal tools
What are OA discovery systems?
Task definition: Given a document identifier
(DOI), give me the URL of a freely accessible
version of the document.
Most successful OA discovery methods
1. Using Crossref as a primary data source and systematically
crawling full text based on Crossref links or other information.
2. Calling a wide range of external APIs in real-time.
We implemented method 1 as a baseline (+ call to CORE in
advance), to understand to what extent are the available
methods better.
How OA Discovery systems work?
Unpaywall OA Button K[.*]io Baseline
Aims to find freely available copies of articles
Help subscribed users access non-OA content
Enriches data to obtain more OA links than already
provided by underlying infrastructures
Builds a database of DOI -> URL mappings
Calls external infrastructure services while serving
users’ request
Disclaimer: It is not possible for us to specify the name of the tool label as K[.*]io. Any potential similarity with an
existing tool on the market is purely coincidental. We make no claims and take no responsibility for any interpretations
that might arise.
Reliance on other infrastructures
Evaluation methodology
• Test all tools on the same data sample (DOIs) and capture the
result
• Query all tools as if they were executed by the user
• Baseline method:
• Collecting links from Crossref and crawling them to find full texts.
• Calling CORE data via CORE API (as a batch prior to execution)
• Evaluation metrics:
• Hit rate - proportion of DOIs for which a URL is returned
• Precision - proportion of true positives, i.e. correctly identified freely
available article copy URLs, in the set of all returned URLs.
• Analysis of the returned results
Data sample
• 100k sample of
DOIs randomly
sampled from
Crossref
• 99% confidence
level a
confidence
interval of
0.41%, i.e. below
1%.
Hit rate
Hit rate with respect to paper publication
year
Precision
• Responding with a URL to a given DOI does not guarantee that
the provided URL leads to a freely available version of the
correct paper.
• We crawl all URLs returned by each tool and test:
• contain the string of the article’s title as recorded in Crossref,
• the text of the resource is the full version of the content
(difficult to automate).Limitation:
overestimates
precision (manual
check needed)
No major differences on
the automated check
Are some tools better for some
disciplines?
No significant
differences
across
disciplines
Pairwise overlap of the returned URLs
Overlap lower
than expected
What hit rate can be achieved if tools are
combined?
We can
improve hit
rate by
combining the
outputs from
multiple
discovery
tools.
Introducing CORE Discovery
• High coverage of freely
available content
• Free service for
researchers by
researchers. No
company controlling the
pipes.
• Best grip on open
repository content.
• Repository integration
• Discovering documents
without a DOI.
https://core.ac.uk/services/discovery/
How CORE Discovery works
• Run a process on a big data cluster merging data from MAG,
Crossref, Unpaywall (2018 dump) and merging with CORE to
find free links in advance.
• Crawling provided links to find full texts.
• If not found, calling EPMC.
• Originally started with:
• OA Button: increased hit-rate but significantly decreased precision.
~32.59% of links discovered by OA Button, which are not discovered by
CORE Discovery and Unpaywall were wrong, based on a manual
check.
• K[.*]io removed the possibility to call API early in 2019. Also not used in
CORE Discovery because of doubts regarding the delivery of many
Researchgate URL links.
CORE Discovery demonstration
Hit rate from Performance of CORE
Discovery
• 10k random sample from Crossref.
CORE Discovery Unpaywall
Not found 7374 7474
Discovered 2626 2526
Hit Rate 26.26% 25.26%
Performance of CORE Discovery
• Manually checked 200
responses where CORE
Discovery and Unpaywall both
returned a URL.
• Precision:
• CORE Discovery: 95.94%
• Unpaywall: 93.4%
CORE Unpaywall
Display page with PDF link 9.64% 5.08%
HTML 7.61% 7.61%
HTML + PDF 3.55% 1.52%
PDF 70.56% 78.17%
PDF in another language 1.02% 1.02%
TOC link 3.55% 0.00%
Dead link 0.51% 0.51%
HTML (abstract only) 0.51% 0.51%
DOI not detected 1.02% 3.55%
Wrong 1.02% 1.02%
Wrong PDF 1.02% 1.02%
Correct 95.94% 93.40%
CORE Discovery Repository integration
• Majority of articles in
repositories metadata only.
• CORE Discovery
repository plugin:
• turns dead ends of user
journeys into journeys
fulfilling users’ information
needs
• makes repository content
more discoverable.
Conclusions
• First study to quantitatively analyse the performance of OA
discovery systems
• We identified:
• Significant differences in the way OA discovery systems operate.
• Strategies that are successful
• Potential for further improvement
• We developed CORE Discovery which offers one-click access
to free copies of research papers whenever you hit the paywall.
• Install CORE Discovery browser extension and/or our repository
plugin.
Acknowledgements
Feedback: CORE Ambassadors, KMI staff, UK Repository
Managers
Lucas Anastasiou
Viktor Yakubiv Harriett Cornish Sergei Misak Nancy Pontika Svetlana Rumyanceva
Samuel Pearce Balviar Notay Chris Biggs Alan Stiles
Thank you!
https://core.ac.uk/services/discovery

Mais conteúdo relacionado

Mais procurados

UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...
UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...
UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...UKSG: connecting the knowledge community
 
ICIC 2013 Conference Proceedings Uwe Rosemann TIB
ICIC 2013 Conference Proceedings Uwe Rosemann TIBICIC 2013 Conference Proceedings Uwe Rosemann TIB
ICIC 2013 Conference Proceedings Uwe Rosemann TIBDr. Haxel Consult
 
Establishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNBEstablishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNBnw13
 
Reference linking and Cited-by
Reference linking and Cited-byReference linking and Cited-by
Reference linking and Cited-byCrossref
 
Using OpenUrl Activity Data Summary for RDTF Day 26 May 11
Using OpenUrl Activity Data Summary for RDTF Day 26 May 11Using OpenUrl Activity Data Summary for RDTF Day 26 May 11
Using OpenUrl Activity Data Summary for RDTF Day 26 May 11EDINA, University of Edinburgh
 
Data Stories: Using Narratives to Reflect on a Data Purchase Pilot Program
Data Stories: Using Narratives to Reflect on a Data Purchase Pilot ProgramData Stories: Using Narratives to Reflect on a Data Purchase Pilot Program
Data Stories: Using Narratives to Reflect on a Data Purchase Pilot ProgramNASIG
 
The Global reach of Crossref metadata
The Global reach of Crossref metadataThe Global reach of Crossref metadata
The Global reach of Crossref metadataCrossref
 
محاضرة برنامج Endnote لتبويب المراجع العلمية د.غادة باوزير
محاضرة برنامج Endnote لتبويب المراجع العلمية د.غادة باوزيرمحاضرة برنامج Endnote لتبويب المراجع العلمية د.غادة باوزير
محاضرة برنامج Endnote لتبويب المراجع العلمية د.غادة باوزيرمركز البحوث الأقسام العلمية
 
20190527_Karen Hytteballe Ibanez _ The OPERA project
 20190527_Karen Hytteballe Ibanez _ The OPERA project 20190527_Karen Hytteballe Ibanez _ The OPERA project
20190527_Karen Hytteballe Ibanez _ The OPERA projectOpenAIRE
 
CrossRef Text & Data Mining - UKSG 2015
CrossRef Text & Data Mining - UKSG 2015CrossRef Text & Data Mining - UKSG 2015
CrossRef Text & Data Mining - UKSG 2015Crossref
 
Using Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case studyUsing Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case studyLeila Zemmouchi-Ghomari
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...petrknoth
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...petrknoth
 
Web scale discovery vs google scholar
Web scale discovery vs google scholarWeb scale discovery vs google scholar
Web scale discovery vs google scholarNikesh Narayanan
 
Collecting and using funding data in your publications
Collecting and using funding data in your publicationsCollecting and using funding data in your publications
Collecting and using funding data in your publicationsCrossref
 
Web Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated SearchWeb Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated SearchNikesh Narayanan
 

Mais procurados (20)

UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...
UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...
UKSG Conference 2017 Breakout - KBART recommendations: challenges and achieve...
 
ICIC 2013 Conference Proceedings Uwe Rosemann TIB
ICIC 2013 Conference Proceedings Uwe Rosemann TIBICIC 2013 Conference Proceedings Uwe Rosemann TIB
ICIC 2013 Conference Proceedings Uwe Rosemann TIB
 
Establishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNBEstablishing the Connection: Creating a Linked Data Version of the BNB
Establishing the Connection: Creating a Linked Data Version of the BNB
 
Reference linking and Cited-by
Reference linking and Cited-byReference linking and Cited-by
Reference linking and Cited-by
 
Using OpenUrl Activity Data Summary for RDTF Day 26 May 11
Using OpenUrl Activity Data Summary for RDTF Day 26 May 11Using OpenUrl Activity Data Summary for RDTF Day 26 May 11
Using OpenUrl Activity Data Summary for RDTF Day 26 May 11
 
Data Stories: Using Narratives to Reflect on a Data Purchase Pilot Program
Data Stories: Using Narratives to Reflect on a Data Purchase Pilot ProgramData Stories: Using Narratives to Reflect on a Data Purchase Pilot Program
Data Stories: Using Narratives to Reflect on a Data Purchase Pilot Program
 
The Global reach of Crossref metadata
The Global reach of Crossref metadataThe Global reach of Crossref metadata
The Global reach of Crossref metadata
 
Content Liberation! How Increasing the Institutional Repository Content Turne...
Content Liberation! How Increasing the Institutional Repository Content Turne...Content Liberation! How Increasing the Institutional Repository Content Turne...
Content Liberation! How Increasing the Institutional Repository Content Turne...
 
محاضرة برنامج Endnote لتبويب المراجع العلمية د.غادة باوزير
محاضرة برنامج Endnote لتبويب المراجع العلمية د.غادة باوزيرمحاضرة برنامج Endnote لتبويب المراجع العلمية د.غادة باوزير
محاضرة برنامج Endnote لتبويب المراجع العلمية د.غادة باوزير
 
20190527_Karen Hytteballe Ibanez _ The OPERA project
 20190527_Karen Hytteballe Ibanez _ The OPERA project 20190527_Karen Hytteballe Ibanez _ The OPERA project
20190527_Karen Hytteballe Ibanez _ The OPERA project
 
لتحليل الدراسات السابقة Nails محاضرة برنامج
  لتحليل الدراسات السابقة Nails محاضرة برنامج  لتحليل الدراسات السابقة Nails محاضرة برنامج
لتحليل الدراسات السابقة Nails محاضرة برنامج
 
CrossRef Text & Data Mining - UKSG 2015
CrossRef Text & Data Mining - UKSG 2015CrossRef Text & Data Mining - UKSG 2015
CrossRef Text & Data Mining - UKSG 2015
 
Using Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case studyUsing Linked Data Resources to generate web pages based on a BBC case study
Using Linked Data Resources to generate web pages based on a BBC case study
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
Web scale discovery vs google scholar
Web scale discovery vs google scholarWeb scale discovery vs google scholar
Web scale discovery vs google scholar
 
Collecting and using funding data in your publications
Collecting and using funding data in your publicationsCollecting and using funding data in your publications
Collecting and using funding data in your publications
 
5-Cited-by-linking
5-Cited-by-linking5-Cited-by-linking
5-Cited-by-linking
 
Web Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated SearchWeb Scale Discovery Vs Federated Search
Web Scale Discovery Vs Federated Search
 
CEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREFCEK KEMIRIPAN PADA CROSSREF
CEK KEMIRIPAN PADA CROSSREF
 

Semelhante a Analysing the performance of open access papers discovery tools

Metadata, Open Access and More: Crossref presentation
Metadata, Open Access and More: Crossref presentationMetadata, Open Access and More: Crossref presentation
Metadata, Open Access and More: Crossref presentationCrossref
 
Content Registration at Crossref - LIVE Kuala Lumpur
Content Registration at Crossref - LIVE Kuala LumpurContent Registration at Crossref - LIVE Kuala Lumpur
Content Registration at Crossref - LIVE Kuala LumpurCrossref
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery servicesNikesh Narayanan
 
New member
New member New member
New member Crossref
 
Working with Crossref and registering content
Working with Crossref and registering contentWorking with Crossref and registering content
Working with Crossref and registering contentCrossref
 
Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriespetrknoth
 
Content Registration, Crossref ALJEBI, Indonesia
Content Registration, Crossref ALJEBI, IndonesiaContent Registration, Crossref ALJEBI, Indonesia
Content Registration, Crossref ALJEBI, IndonesiaCrossref
 
New member webinar 052418
New member webinar 052418New member webinar 052418
New member webinar 052418Crossref
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Who is using your content?
Who is using your content? Who is using your content?
Who is using your content? Crossref
 
Crossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref
 
Evaluating the Quality of OpenURLs Through Analytics (TLA 2012)
Evaluating the Quality of OpenURLs Through Analytics (TLA 2012)Evaluating the Quality of OpenURLs Through Analytics (TLA 2012)
Evaluating the Quality of OpenURLs Through Analytics (TLA 2012)Rafal Kasprowski
 
Crossref LIVE US Online
Crossref LIVE US OnlineCrossref LIVE US Online
Crossref LIVE US OnlineCrossref
 
K3 edith falk_discoverytoolslibrary
K3 edith falk_discoverytoolslibraryK3 edith falk_discoverytoolslibrary
K3 edith falk_discoverytoolslibraryevaminerva
 
Crossref Overview - Russian webinar
Crossref Overview - Russian webinar Crossref Overview - Russian webinar
Crossref Overview - Russian webinar Crossref
 
Web Scale Discovery Services: Google like search experience
Web Scale Discovery Services: Google like search experienceWeb Scale Discovery Services: Google like search experience
Web Scale Discovery Services: Google like search experienceNikesh Narayanan
 
Content Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokContent Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokCrossref
 
Crossref for Ambassadors - Introductory webinar
Crossref for Ambassadors - Introductory webinarCrossref for Ambassadors - Introductory webinar
Crossref for Ambassadors - Introductory webinarVanessa Fairhurst
 

Semelhante a Analysing the performance of open access papers discovery tools (20)

Metadata, Open Access and More: Crossref presentation
Metadata, Open Access and More: Crossref presentationMetadata, Open Access and More: Crossref presentation
Metadata, Open Access and More: Crossref presentation
 
Content Registration at Crossref - LIVE Kuala Lumpur
Content Registration at Crossref - LIVE Kuala LumpurContent Registration at Crossref - LIVE Kuala Lumpur
Content Registration at Crossref - LIVE Kuala Lumpur
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery services
 
New member
New member New member
New member
 
APA ITU DOI?
APA ITU DOI?APA ITU DOI?
APA ITU DOI?
 
Working with Crossref and registering content
Working with Crossref and registering contentWorking with Crossref and registering content
Working with Crossref and registering content
 
Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositories
 
Content Registration, Crossref ALJEBI, Indonesia
Content Registration, Crossref ALJEBI, IndonesiaContent Registration, Crossref ALJEBI, Indonesia
Content Registration, Crossref ALJEBI, Indonesia
 
New member webinar 052418
New member webinar 052418New member webinar 052418
New member webinar 052418
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Who is using your content?
Who is using your content? Who is using your content?
Who is using your content?
 
Crossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE MumbaiCrossref Content Registration - LIVE Mumbai
Crossref Content Registration - LIVE Mumbai
 
Evaluating the Quality of OpenURLs Through Analytics (TLA 2012)
Evaluating the Quality of OpenURLs Through Analytics (TLA 2012)Evaluating the Quality of OpenURLs Through Analytics (TLA 2012)
Evaluating the Quality of OpenURLs Through Analytics (TLA 2012)
 
Crossref LIVE US Online
Crossref LIVE US OnlineCrossref LIVE US Online
Crossref LIVE US Online
 
K3 edith falk_discoverytoolslibrary
K3 edith falk_discoverytoolslibraryK3 edith falk_discoverytoolslibrary
K3 edith falk_discoverytoolslibrary
 
Crossref Overview - Russian webinar
Crossref Overview - Russian webinar Crossref Overview - Russian webinar
Crossref Overview - Russian webinar
 
Web Scale Discovery Services: Google like search experience
Web Scale Discovery Services: Google like search experienceWeb Scale Discovery Services: Google like search experience
Web Scale Discovery Services: Google like search experience
 
Content Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE BangkokContent Registration at Crossref - LIVE Bangkok
Content Registration at Crossref - LIVE Bangkok
 
Crossref for Ambassadors - Introductory webinar
Crossref for Ambassadors - Introductory webinarCrossref for Ambassadors - Introductory webinar
Crossref for Ambassadors - Introductory webinar
 

Mais de petrknoth

Qui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingQui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingpetrknoth
 
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in RepositoriesOAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositoriespetrknoth
 
UKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet themUKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet thempetrknoth
 
Enabling Educators to Locate High-Quality Teaching Resources
Enabling Educators to LocateHigh-Quality Teaching ResourcesEnabling Educators to LocateHigh-Quality Teaching Resources
Enabling Educators to Locate High-Quality Teaching Resourcespetrknoth
 
Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)petrknoth
 
Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure petrknoth
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...petrknoth
 
Seamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncSeamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncpetrknoth
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluationpetrknoth
 
Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...petrknoth
 
My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?petrknoth
 
FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)petrknoth
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publicationspetrknoth
 
From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...petrknoth
 
DiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected RepositoriesDiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected Repositoriespetrknoth
 
DEVCSI Core Mobile
DEVCSI Core MobileDEVCSI Core Mobile
DEVCSI Core Mobilepetrknoth
 
Text mining in CORE (OR2012)
Text mining in CORE (OR2012)Text mining in CORE (OR2012)
Text mining in CORE (OR2012)petrknoth
 
CORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open AccessCORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open Accesspetrknoth
 
CORE projects family
CORE projects familyCORE projects family
CORE projects familypetrknoth
 

Mais de petrknoth (20)

Qui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingQui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishing
 
CORE APIv3
CORE APIv3CORE APIv3
CORE APIv3
 
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in RepositoriesOAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
 
UKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet themUKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet them
 
Enabling Educators to Locate High-Quality Teaching Resources
Enabling Educators to LocateHigh-Quality Teaching ResourcesEnabling Educators to LocateHigh-Quality Teaching Resources
Enabling Educators to Locate High-Quality Teaching Resources
 
Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)
 
Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
 
Seamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncSeamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSync
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluation
 
Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...
 
My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?
 
FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)
 
Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
 
From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...
 
DiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected RepositoriesDiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected Repositories
 
DEVCSI Core Mobile
DEVCSI Core MobileDEVCSI Core Mobile
DEVCSI Core Mobile
 
Text mining in CORE (OR2012)
Text mining in CORE (OR2012)Text mining in CORE (OR2012)
Text mining in CORE (OR2012)
 
CORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open AccessCORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open Access
 
CORE projects family
CORE projects familyCORE projects family
CORE projects family
 

Último

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...KarteekMane1
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 

Último (20)

Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
wepik-insightful-infographics-a-data-visualization-overview-20240401133220kwr...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 

Analysing the performance of open access papers discovery tools

  • 1. Analysing the performance of open access papers discovery tools Petr Knoth Matteo Cancellieri June 13, 2019 – OR 2019, Hamburg, Germany CORE Big Scientific Data and Text Analytics Group Knowledge Media Institute, The Open University
  • 2. Why Open Access (OA) Discovery? • Automating the process of finding a full text of a research paper • Identifying free copies of paywalled papers • Reducing the access process to just one-click • Analysis and monitoring of OA, subscriptions negotiation • Discovery tools: Browser extensions, system integrations, public APIs
  • 3. Search vs Discovery • Search: Given a query, find relevant papers • Discovery: Given a document identifier(s), give me the full text
  • 4. Aims of this work In scope: • Quantitatively compare and evaluate OA discovery tools using widely established information retrieval metrics • Identify gaps for improvement of OA discovery tools • Design a tool that maximises performance (CORE Discovery) Not in scope: • Discovery beyond freely available content • Illegal tools
  • 5. What are OA discovery systems? Task definition: Given a document identifier (DOI), give me the URL of a freely accessible version of the document.
  • 6. Most successful OA discovery methods 1. Using Crossref as a primary data source and systematically crawling full text based on Crossref links or other information. 2. Calling a wide range of external APIs in real-time. We implemented method 1 as a baseline (+ call to CORE in advance), to understand to what extent are the available methods better.
  • 7. How OA Discovery systems work? Unpaywall OA Button K[.*]io Baseline Aims to find freely available copies of articles Help subscribed users access non-OA content Enriches data to obtain more OA links than already provided by underlying infrastructures Builds a database of DOI -> URL mappings Calls external infrastructure services while serving users’ request Disclaimer: It is not possible for us to specify the name of the tool label as K[.*]io. Any potential similarity with an existing tool on the market is purely coincidental. We make no claims and take no responsibility for any interpretations that might arise.
  • 8. Reliance on other infrastructures
  • 9. Evaluation methodology • Test all tools on the same data sample (DOIs) and capture the result • Query all tools as if they were executed by the user • Baseline method: • Collecting links from Crossref and crawling them to find full texts. • Calling CORE data via CORE API (as a batch prior to execution) • Evaluation metrics: • Hit rate - proportion of DOIs for which a URL is returned • Precision - proportion of true positives, i.e. correctly identified freely available article copy URLs, in the set of all returned URLs. • Analysis of the returned results
  • 10. Data sample • 100k sample of DOIs randomly sampled from Crossref • 99% confidence level a confidence interval of 0.41%, i.e. below 1%.
  • 12. Hit rate with respect to paper publication year
  • 13. Precision • Responding with a URL to a given DOI does not guarantee that the provided URL leads to a freely available version of the correct paper. • We crawl all URLs returned by each tool and test: • contain the string of the article’s title as recorded in Crossref, • the text of the resource is the full version of the content (difficult to automate).Limitation: overestimates precision (manual check needed) No major differences on the automated check
  • 14. Are some tools better for some disciplines? No significant differences across disciplines
  • 15. Pairwise overlap of the returned URLs Overlap lower than expected
  • 16. What hit rate can be achieved if tools are combined? We can improve hit rate by combining the outputs from multiple discovery tools.
  • 17. Introducing CORE Discovery • High coverage of freely available content • Free service for researchers by researchers. No company controlling the pipes. • Best grip on open repository content. • Repository integration • Discovering documents without a DOI. https://core.ac.uk/services/discovery/
  • 18. How CORE Discovery works • Run a process on a big data cluster merging data from MAG, Crossref, Unpaywall (2018 dump) and merging with CORE to find free links in advance. • Crawling provided links to find full texts. • If not found, calling EPMC. • Originally started with: • OA Button: increased hit-rate but significantly decreased precision. ~32.59% of links discovered by OA Button, which are not discovered by CORE Discovery and Unpaywall were wrong, based on a manual check. • K[.*]io removed the possibility to call API early in 2019. Also not used in CORE Discovery because of doubts regarding the delivery of many Researchgate URL links.
  • 20. Hit rate from Performance of CORE Discovery • 10k random sample from Crossref. CORE Discovery Unpaywall Not found 7374 7474 Discovered 2626 2526 Hit Rate 26.26% 25.26%
  • 21. Performance of CORE Discovery • Manually checked 200 responses where CORE Discovery and Unpaywall both returned a URL. • Precision: • CORE Discovery: 95.94% • Unpaywall: 93.4% CORE Unpaywall Display page with PDF link 9.64% 5.08% HTML 7.61% 7.61% HTML + PDF 3.55% 1.52% PDF 70.56% 78.17% PDF in another language 1.02% 1.02% TOC link 3.55% 0.00% Dead link 0.51% 0.51% HTML (abstract only) 0.51% 0.51% DOI not detected 1.02% 3.55% Wrong 1.02% 1.02% Wrong PDF 1.02% 1.02% Correct 95.94% 93.40%
  • 22. CORE Discovery Repository integration • Majority of articles in repositories metadata only. • CORE Discovery repository plugin: • turns dead ends of user journeys into journeys fulfilling users’ information needs • makes repository content more discoverable.
  • 23. Conclusions • First study to quantitatively analyse the performance of OA discovery systems • We identified: • Significant differences in the way OA discovery systems operate. • Strategies that are successful • Potential for further improvement • We developed CORE Discovery which offers one-click access to free copies of research papers whenever you hit the paywall. • Install CORE Discovery browser extension and/or our repository plugin.
  • 24. Acknowledgements Feedback: CORE Ambassadors, KMI staff, UK Repository Managers Lucas Anastasiou Viktor Yakubiv Harriett Cornish Sergei Misak Nancy Pontika Svetlana Rumyanceva Samuel Pearce Balviar Notay Chris Biggs Alan Stiles

Notas do Editor

  1. Highest coverage of freely available content. Our tests have shown CORE Discovery finding more free content than any other discovery system. Free service for researchers by researchers. CORE Discovery is the only free content discovery extension developed by researchers for researchers. There is no major publisher or enterprise controlling and profiting from your usage data. Best grip on open repository content. Due to CORE being a leader in harvesting open access literature, CORE Discovery has the best grip on open content from open repositories as opposed to other services that disproportionately focus only on content indexed in major commercial databases. Repository integration and discovering documents without a DOI. The only service offering seamless and free integration into repositories. CORE Discovery is also the only discovery system that can locate scientific content even for items with an unknown DOI or which do not have a DOI.
  2. Open access discovery tools locate freely available copies of research papers which might be behind the paywall K[.*]io