Tracking citations to research software via persistent identifiers is difficult due to dilution of citations over many PIDs assigned to a software package. On top of this, software citations are often consistently being edited out by every actor part of the scholarly communication process such as reference managers, publishers, professors and discovery systems. Thus, the survival rate of a software citation is extremely low in the current scholarly ecosystem. The Sloan-funded Asclepias project is a collaboration between a publisher, discovery system and repository with the goal to promote scientific software into an identifiable, citable, and preservable object. We have built a citation broker that is currently tracking some 6.000 citations to Zenodo DOIs from NASA ADS,
CrossRef and EuropePMC.
5. !5
The Asclepias project
• Brokering and harvesting scholarly links
• Open citation data in ADS, Crossref/DataCite
Event Data and Europe PMC.
• ~6000 citations to Zenodo records (January
2019)
6. !6
Author of scholarly manuscripts.
Developer of scientific software.
Roles of researcher
CreditSoftware
8. Systemic key issues
!8
Developer Repository
Discovery
service
Write software Publish software1. Information loss
2. Dilution of citations
9. Information loss: Author
• What do I cite? Paper, software, software version
• Citation recommendations
• Reference manager (e.g. BibTeX, Endnote, …)
• Exists? Correct? BibTeX Latency
• “Software” type doesn’t exists.
• No version field support in BibTeX.
• Persistent identifier for software
• zero, one or more?
!9
Include citation in paper
12. Information loss: Publisher
• Policy prohibits software citation.
• Journal authoring system defects:
• Information from BibTeX is lost
• CrossRef DOI ➞ JATS XML ➞ PDF
• Copy editors needs training
• Journal -> Scientific society -> Publisher -> Vendor platform ->
Outsourcing
!12
Include citation in products
13. Information loss: Metadata quality
• Example (cite arXiv identifiers)
• yymm.nnnnv1 (published 2012)
• yymm.nnnnv7 (published 2017)
• Paper from 2015 cites “yymm.nnnn”
• Result: 2015 paper cites 2017 software
because metadata doesn’t say 2012.
!13
15. Information loss: Discovery Service
• Paper ingest workflow:
• 1) identify link 2) create/update local record?
3) attribute citation link.
• Policy prohibits software records.
• Ingestion workflow incapable of identifying software
• Non-trivial to identify local record.
!15
Ingest paper and track citations
16. Discovery service differences
!16
• Europe PMC: 71 different publishers
• Springer, F1000, PLOS, PeerJ,
Pensoft, Frontiers
• Crossref: 57 different publishers
• Springer, F1000, Pensoft, PeerJ,
Wiley
• NASA ADS: 38 different publishers
• arXiv, American Astronomical
Society, Springer, IOP, Oxford
University Press, Elsevier
19. Dilution of citations: Developer/Repository
• Persistent identifier: Software, software paper,
discovery system identifier (i.e. zero, one or more PIDs)
• Dynamic authorship
• Software name changes
• Granularity: DOI per version, module, module version…
!19
Ensure software is citable
24. !24
How can we expect researchers to
change culture, if we can’t even
track citations to software?
25. Generality
• Search/Replace:
• “Software” with “Data”, … (except “Paper”)
• “Astronomy” with “Physics”, …
• Problems:
• Information loss, dilution of citations,
closed proprietary systems.
!25
26. The “fix” of a chain linked system
!26
Systemic issues need joint effort to be solved.
27. The “Fix”: Publisher
• Software citation policy
• Authoring system:
• Working with vendor to
produce correct DOI
metadata and JATS
XML (machine
readability).
!27
28. The “Fix”: Discovery
• Ingestion workflow for
software with DataCite
DOI, handling:
• Synonymous PIDs
• Version relationships
• BibTeX generation fixes
!28
29. The “Fix”: Repository
• DOI Versioning:
• Version relationships
• Version number field
• DataCite metadata
• Dynamic authorship
• BibTeX generation fixes
• GitHub integration
!29
v1.0 v1.2
SW
30. The “Challenge”: Roll-up citations for software
• Goal: Proper credit for software
• Roll-up citations for software
• Synonymous PIDs (identifies a resource)
• Version relationships (identifies group resources)
• Citation relationships (links between groups of resources)
• Expert curation (actions in individual systems)
• Information needed by all:
• Discovery systems
• Repositories
• Problem: Share and exchange information about scholarly links.
!30
31. Software citation today
• Primarily self-citation (~80% of citations)
• Not necessarily bad (SW citation principles)
• Citation count >5 (~2%)
• Generic libraries (neural networks, stats
visualisation, …)
• Citation recommendations in a bad shape
• Each recommendation has a unique story
!31
33. • Software citation is in a pretty bad shape …but
don’t despair (still infancy)!
• Systemic issues can only be solved with joint efforts
• Problems exposed also impact PIDs knowledge
graphs in general.
!33
Software citation today