CrossRef provides a text and data mining hub for researchers. It has built a cross-publisher API to allow researchers to access full text content from participating publishers for open access or subscribed content using a common protocol. The API addresses issues like negotiating permissions by including licensing information in article metadata and a registry of text and data mining terms and conditions. Over 14 million articles from publishers now include full-text links and license information to enable text and data mining through the CrossRef API.
2. Not-for-profit association of scholarly publishers
All subjects, all business models
5,000+ organizations from all over the world
83 non-publisher affiliates, 2000 library affiliates
72 million + DOIs assigned to content items
4. User clicks on
CrossRef DOI
reference link
in Journal A
Tani, N., N. Tomaru, M. Araki, AND K. Ohba. 1996. Genetic diversity and
differentiation in populations of Japanese stone pine (Pinus pumila) in
Japan. Canadian Journal of Forest Research 26: 1454–1462.[CrossRef]
DOI
directory
returns URL
User accesses
cited article in
Journal B
7. What is Text and Data Mining
(TDM)?
Text Mining is an interdisciplinary field combining techniques
from linguistics, computer science and statistics to build
tools that can efficiently retrieve and extract information
from digital text.
http://blogs.plos.org/everyone/2013/04/17/announcing-the-plos-text-mining-collection/
It uses powerful computers to find links between drugs
and side effects, or genes and diseases, that are hidden
within the vast scientific literature. These are discoveries
that a person scouring through papers one by one may
never notice.
http://www.theguardian.com/science/2012/may/23/text-mining-research-tool-forbidden
8. Why?• Researchers find it impractical to
negotiate multiple bilateral agreements
with hundreds of subscription-based
publishers in order to authorise TDM of
subscribed content.
• Subscription-based publishers find it
impractical to negotiate multiple bilateral
agreements with thousands of
researchers and institutions in order to
authorise TDM of subscribed content.
• All parties would benefit from support of
standard APIs and data representations in
order to enable TDM across both open
access and subscription-based publishers.
11. Access To Full Text
Problem: Researchers want to get full text
content from publishers’ sites for OA or
subscribed content. Solution:
Solution: Common API (protocol) for requesting
machine readable full text from many different
publishers
12. Negotiating Permissions
Problem: Researchers want to know whether text
and data mining is allowed, and if not, get
permission.
Solution: Licensing information embedded in article
metadata and a registry for supplemental text and
data mining terms and conditions (licenses).
13. Text and Data Mining Steps
• Define problem
• Identify potential corpus to mine
• Discovery (full text links)
• Identification of subset which can be
accessed (license information)
• Download identified corpus
• Text and data mine corpus
15. Publisher Participation
To enable their content for use by the service, publishers have
to provide CrossRef with two additional pieces of metadata:
• Full text URIs (to show where the full-text is located)
• License URIs (to show the Terms & Conditions under
which they can use it)
• Can implement rate limiting
CrossRef doesn’t charge publishers for participating in this
service.
16. Researcher Use
• The CrossRef REST API is the main aspect of this service
• It is designed to allow researchers to easily harvest full text
documents from all participating publishers regardless of their
business model (e.g. open access, subscription).
• It makes use of CrossRef DOI content negotiation to provide
researchers with links to the full text of content located on the
publisher’s site.
• The publisher remains responsible for actually delivering the full
text of the content requested
• CrossRef does not charge researchers for using the service
24. Researcher queries DOI using CN + API
token
Publisher verifies API token
If token verified AND access control allows,
publisher returns full text
(frequency at publisher discretion)
25. Benefits
• Streamlines researcher access to distributed full text for
TDM
• Enables machine-to-machine, automated access for
recognized TDM (i.e. researchers won’t be locked out of
publisher sites)
• Enables article-level licensing info and easy mechanism
for supplemental T&Cs for text and data mining
(publishers discussing model license via STM)
30. How can researchers use
the service?
• Modify TDM tools to make use of the API token
• Modify TDM tools to look for <lic_ref> elements
• Register with the click-through service and
accept/decline licenses (if applicable)
• Details at: http://tdmsupport.crossref.org/researchers/
31. Using the DOI as the basis for a common text and data mining
API provides several benefits. For example, the DOI provides:
•An easy way to de-duplicate documents that may be found on
several sites.
•Persistent provenance information.
•An easy way to document, share and compare coropra without
having to exchange the actual documents
•A mechanism to ensure the reproducibility of TDM results using
the source documents.
•A mechanism to track the impact of updates, corrections
retractions and withdrawals on corpora.
Why use the DOI?
Notas do Editor
Questions at end. Talk a little bit about what CrossRef is then move on to talk about our text and data mining service.
First just a few words about CrossRef for anyone who isn’t a member or might not be familiar with us as an organisation. CrossRef is a not-for-profit membership organisation of international scholarly publishers. We have 4000 member publishers, representing all disciplines - not just STM, and comprising commercial publishers, academic societies, open access publishers, university presses. We also have 83 affiliate members and 2000 library affiliates - these libraries and other organisations make use of the CrossRef database to look up DOIs and metadata. We are the largest DOI registration agency and have assigned nearly 63 million DOIs to date.
Publishers were finding that web sites changed, content moved, and links that they had put into their articles stopped working.
So they started a multi-publisher initiative to solve this problem of broken links. This is done using the DOI - the Digital Object Identifier, which I’m sure many of you are familiar with. A CrossRef DOI is simply a unique identifier for a piece of content. Once assigned, it doesn’t change. It is to all intents and purposes a meaningless number, but it allows that piece of content to be located on the web.
And it works like this: publishers use CrossRef DOIs to link to content, usually from the references at the end of articles. Users click on those DOI-based links and are referred via the CrossRef database to the cited article at it’s correct location on the web. If content moves the publisher only has to update the CrossRef database once, and all of the publishers that are linking to their content using CrossRef DOIs will be redirected to the content in its new location.
Every month there are around 90 million clicks on CrossRef DOI links, so 100 million citations resolved to content.
The issue of Text and Data Mining has become very important and CrossRef is in a unique position to expand its current infrastructure (a registry of unique identifiers and metadata for scholarly content and thousands of members) to make TDM easier for researchers and their institutions and publishers.
Technical solution - we aren’t addressing the issue of licensing. CrossRef services are based around collaboration – achieving things across the industry that it wouldn’t make sense for each publisher to implement individually.
Why did CrossRef develop this service? Applies to OA content too. Let’s just illustrate these issues.
Bilateral agreements aspect - In the past, researchers who wish to text and data mine published literature have no common or simple way of accessing the full text for the content they wish to mine. This is true both of subscription-based content as well as of open access content. Consequently, TDM users access the content in one or two ways:
Negotiating with publishers to have the content delivered to them, either via physical media or bulk data transfer (e.g. FTP)
“Screen-scraping” the publisher’s website.
The first option doesn’t scale well across multiple Publishers and Researchers. It also presents synchronisation problems if the researchers want an ongoing feed of refreshed content.
The issue with the second option is that “screen scraping” is an inefficient, fragile and error prone mechanism for identifying and downloading full text. Screen scrapers put a large performance burden on web sites and, at the same time, any slight changes to the web site can break the tool that is doing the screen scraping.
CrossRef Text and Data Mining provides a common solution which works across Open Access and subscription-based publishers and is free for anyone to use.
Application programming interface. Prootcol for requesting the information.
Needs publishers to deposit full text links
And links to license information
CrossRef service trying to deal with these three steps.
Discovery of where the full text is located, finding out if you have permission to mine it, and then pulling back that corpus of content in order to work on it.
This needs to be added to the publisher XML – license information at the article-level. Examples on our support site.
This needs to be added to the publisher XML – license information at the article-level. Examples on our support site.
Publishers who require researchers to agree to a specific set of Terms and Conditions (T&Cs) before they are allowed to text and data mine content that they otherwise have access to (e.g. through an existing subscription) will need to make use of the click-through service. The click-through service is a registry for supplemental text and data mining terms and conditions (licenses).
So to put it all together…
Working group which will migrate to a full CrossRef Committee when the service is officially launched seen over 100,000 deposits of full text links and license information, mainly from Hindawi, Elsevier & KAMJE.
Eric Lease Morgan
Support site with info. Info on rate limiting on there too.
Publishers and researchers in pilot.
Launch in May
Rate limiting too
Processing the same document on multiple sites could easily skew text and data mining results and traditional techniques for eliminating duplicates (e.g. hashes, etc.) will not work reliably if the document in question exists in several representations (e.g. PDF, HTML, ePub ) and/or versions (e.g. accepted manuscript, version of record)
Using the DOI as a key will allow researchers to retrieve and verify the provenance of the items in the TDM corpus, many years into the future when traditional HTTP URLs will have already broken