When we look at the rapid growth of scientific databases on the Internet in the past decade, we tend to take the accessibility and provenance of the data for granted. As we see a future of increased database integration, the licensing of the data may be a hurdle that hampers progress and usability. We have formulated four rules for licensing data for open drug discovery, which we propose as a starting point for consideration by databases and for their ultimate adoption. This work could also be extended to the computational models derived from such data. We suggest that scientists in the future will need to consider data licensing before they embark upon re-using such content in databases they construct themselves.
Why open drug discovery needs four simple rules for licensing data and models
1. Perspective
Why Open Drug Discovery Needs Four Simple Rules for
Licensing Data and Models
Antony J. Williams1*, John Wilbanks2, Sean Ekins3
1 Royal Society of Chemistry, Wake Forest, North Carolina, United States of America, 2 Consent to Research, Oakland, California, United States of America, 3 Collaborations
in Chemistry, Fuquay-Varina, North Carolina, United States of America
Abstract: When we look at the platforms or derived models without care inside pharmaceutical companies to mesh
rapid growth of scientific databases given to data quality is a poor strategy for with their existing private data [18],
on the Internet in the past decade, long-term science [10] as errors become including in the expanding Linked Open
we tend to take the accessibility perpetuated in additional databases. There Data cloud or in freely available online
and provenance of the data for is real evidence that the integration of large, databases, and can be downloaded and
granted. As we see a future of heterogeneous sets of databases and other used to enhance their content and to
increased database integration, the types of content is ‘‘unreasonably effective’’ establish linking between data. The Open
licensing of the data may be a at accelerating the conversion of data into PHACTS project [19,20] utilizes a se-
hurdle that hampers progress and knowledge [11]. This implies the need for mantic web approach to integrate chem-
usability. We have formulated four technical and semantic work to bring istry and biology data across a myriad of
rules for licensing data for open databases together that were never de- data sources, including for chemistry
drug discovery, which we propose signed for interoperability [12], which is in ChEBI, ChEMBL, and DrugBank, and
as a starting point for consideration itself a significant task [13,14]. for biology UniProt, Wikipathways, and
by databases and for their ultimate As we and others have argued previ- many others. The chemical structure
adoption. This work could also be ously, there is another dimension to representations are obtained from Chem-
extended to the computational interoperability than technical formats Spider, which has previously imported the
models derived from such data. [12] and ontological agreement [15]: the chemical databases and standardized
We suggest that scientists in the complex interactions of database licenses according to their data model and are
future will need to consider data
and terms of use around intellectual making the data available as open data to
licensing before they embark upon
property. Many of these online databases the project. Many of the primary online
re-using such content in databases
they construct themselves. have either obscure or confused licensing databases already have multiple links to
terms [16], and even in those cases where external systems. This linking may be
data are freely available for download and achieved by using available database
reuse there are often no clear definitions. services to form transitory links in by,
Introduction
Many databases simply ‘‘cut and paste’’ for example, using a chemical represen-
Public online databases [1] supporting prohibitive copyright schema from tradi- tation such as an InChI [21] to probe an
life sciences research have become valu- tional websites, or fail to address download application programming interface,
able resources for researchers depending and reintegration entirely (ibid). Since search for the compound, and generate
on data for use in cheminformatics, copyright law requires explicit permissions the linking URL in real time. Commonly,
bioinformatics, systems biology, transla- in advance to make use of copyrighted however, the links are more permanent in
tional medicine, and drug repositioning works, it is certainly unsafe to assume data nature and are generated by downloading
efforts, to name just a few of the potential licensing rights for any database that does data from the various data sources,
end user groups. Worldwide funding not explicitly allow it. depositing a subset of the data (generally
agencies (governments and not-for-profits) The availability of data for download the chemical compound and associated
have invested in public domain chemistry and reuse is an important offering to the database identifier), and using the partic-
platforms. In the United States these community, as these data may be used for ular database URL structure to form
include PubChem [2], ChemIDPlus [3], the purpose of modeling to develop permanent links. This act of download
and the Environmental Protection prediction tools [17]. In addition, data and deposition of multiple data sources is
Agency’s ACToR [4], while the United can be ingested into internal systems commonly mixing the various licenses, if
Kingdom has funded ChEMBL [5] and
ChemSpider [6], among others, and new Citation: Williams AJ, Wilbanks J, Ekins S (2012) Why Open Drug Discovery Needs Four Simple Rules for
databases continue to appear annually [7]. Licensing Data and Models. PLoS Comput Biol 8(9): e1002706. doi:10.1371/journal.pcbi.1002706
We have argued recently that the data Editor: Philip E. Bourne, University of California San Diego, United States of America
quality contained within many of these Published September 27, 2012
databases is suspect [8] and scientists
Copyright: ß 2012 Williams et al. This is an open-access article distributed under the terms of the Creative
should consider issues of data quality [9] Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
when using these resources. By assimilating provided the original author and source are credited.
various data sources together and meshing Funding: The authors received no specific funding for this article.
data on drugs, proteins, and diseases, these
Competing Interests: Sean Ekins consults for Collaborative Drug Discovery, Inc. and is on the Board of
various databases and network and com- Directors of the Pistoia Alliance. Antony J. Williams is employed by The Royal Society of Chemistry, which hosts
putational methods may be useful to the ChemSpider database discussed in this article. John Wilbanks consults for and sits on the Board of Directors
accelerate drug discovery efforts. The at Sage Bionetworks, which runs an open access database of genomic and health information.
development of related cheminformatics * E-mail: tony27587@gmail.com
PLOS Computational Biology | www.ploscompbiol.org 1 September 2012 | Volume 8 | Issue 9 | e1002706
2. licenses are even declared, which, in essary when the discussion is framed this 4. Don’t ever lock up metadata. A signif-
many cases, they are not. way. icant swath of data will be incompatible
In some ways, there are analogous It is also important to avoid noncom- with an open regime, whether it’s to
difficulties in the exchange of computa- mercial or share-alike approaches whenev- protect trade secrets or patient privacy.
tional models like quantitative structure er possible. These are attractive terms to But the metadata that describes closed
activity relationship (QSAR) datasets many data providers, but create significant data, and how to access closed data, can
[22]—while there are efforts to standard- barriers to interoperability. Noncommer- be almost as valuable. If you can’t make
ize how the data and models are stored, cial data might be incompatible for re- the data public domain, make the
queried, and exchanged, there has been searchers at a pharmaceutical company, metadata public domain.
little consideration of licenses required to even to run a simple web-based query. It is
enable making the sharing of open source important to realize data under a share- As a general rule, these four simple rules
models a reality [23]. Similarly, one could alike license from one entity is probably not should allow us to build a more stable data
consider the creation of maps of disease combinable with data under a share-alike and model sharing ecosystem while we live
and how they are shared and reused [24] license from another entity (this lack of with some uncertainties until the courts
in the same manner. interoperability kept Creative Commons rule on where the line of property stops
licensed images out of Wikipedia for years, and starts. We can’t wait for the certainty
The potential legal fragility of knowledge
and is not one we wish to introduce into the to emerge, but we also want our systems to
products derived from online databases
ecosystem again!). work when the courts do finally rule on
with poorly understood licensing for each
Thus, we propose the following simple issues such as where data and metadata
of the databases is a real problem, and one
rules for developing data licensing ap- stop and start, where copyright attaches,
that will only increase in severity over time.
proaches inside scientific projects. how data rights really affect re-use, and
This realization is not novel; indeed, the
what it means to move towards a ‘‘cloud
chemical blogosphere has been host to
1. Before you begin a database project, world’’ where copies aren’t made of data
many discussions regarding the need for
convene a meeting of all of the at all. Following these heuristics when
clear data licensing definitions on chemis-
stakeholders. Expose all of the expec- providing and/or accepting data is an
try-related data. Many scientists likely echo
tations of the group and decide if your approach that creates at least the oppor-
these comments, but we will provide some
goals are primarily scientific, commer- tunity to be forward-compatible for the
examples. In particular, Peter Murray-Rust
cial, or mixed. If mixed, take a stern future development of technologies.
[25] espouses the value of ‘‘open data’’ [26]
look at the actual commercial potential But it is also important to pay close
to the scientific discovery process and
of the project. Invite technology trans- attention to licensing sanitation as a data
encourages clear licensing of all chemistry
fer offices to join you—they have consumer and user. No matter how tempting
data according to Open Knowledge Defi-
greater experience in the realities of it is, do not copy a batch of informally open,
nition (OKD) [27] and the Panton Princi-
commercialization. but formally closed, data, run a database
ples [28].
2. If your project is scientific in nature, and integration, and release the new database as
Herein we provide an extensive back- ‘‘open’’—that hurts the community. Instead,
ground to the intellectual property around not commercial, explore the benefits of
look for the terms of use, ask if it is ‘‘open’’,
data and databases in the sciences in- open licensing and drawbacks of enclo-
post your enquiry, and only when you are
volved in drug discovery, those of biology, sure. Go through the various definitions
certain, redistribute. We think databases
chemistry, and related fields, as well as and find the most common ground
funded by the government should at the very
discussion of open data licensing, open- possible, always placing the burden of
least be open, and if not this should be stated
ness, and open license limitations (Text proof on those who want more control
prominently.
S1). More importantly, we provide a set of and not less. This will create less ‘‘default
rules that practitioners might apply when enclosure’’ but allow for those increasingly
rare situations in which ‘‘open’’ is not Conclusions
making data or databases available via the
Internet or mobile apps [29]. Our ultimate appropriate. Attempt to hew as closely as Although most scientists are likely unaware
goal is to illuminate the legal fragility of possible to the admittedly rigorous open of this at present, data licenses are going to
the database ecosystem in the drug definitions and standards, and do not become increasingly important in science in
discovery sciences, and to initiate a write your own intellectual property the future, especially as we see more scientists
conversation about creating best practices. licenses—instead, use existing and well embracing open notebook science, open
deployed ones. science, and open-access publishing, and
Simple Rules for Licensing 3. Develop simple explanations of your funding bodies promoting the increased
‘‘Open’’ Data terms of use, and make them easy to accessibility of the fruits of their funding.
find for users. Make sure that your We are likely not too far from funding bodies
We suggest based on our analysis of the licensing, expectations for attribution, mandating immediate release of all data and
current data situation (Text S1) the ideal is terms of use, and more are linked in results produced by each of their grantees,
to use strong default rules for openness. many ways to your data and database. which is something we would advocate as
From a copyright and database rights Do not expect your users to read the potentially disruptive in its own right (S. Ekins
perspective, the public domain gives the legal text of your terms and conditions et al., unpublished data).
most clarity and should be the default and licenses; instead, create simple We can hence imagine a near future in
setting for data deposit, although it may summaries with linkages to the detailed which many scientists will blog some or all
not always be achievable. Understanding text for users to access. Whenever of their research results while data aggre-
this is vital, because it sets the bar at the possible, use metadata to indicate the gators will in turn consume this content
right height. Justifications for additional licensing terms explicitly—the Creative and repackage it for others [31]. The
controls should be subject to argument— Commons Rights Expression Lan- licensing of this and other data will need to
one often finds those controls are unnec- guage [30] is a good tool for this. be clear if we are to build on the shoulders
PLOS Computational Biology | www.ploscompbiol.org 2 September 2012 | Volume 8 | Issue 9 | e1002706
3. of giants and not have to face legal battles discovery represent a proposed starting Supporting Information
that pit Davids versus Goliaths. Consider- point for consideration by database pro-
ing data licensing as a part of the ducers. These licenses could equally be Text S1 This consists of a discussion in
‘‘scientific process’’ is vital for its future used by individual scientists on their blogs three sections:
usability, and we strongly encourage and other online environments or ac- N Intellectual property rights in data:
scientists to consider data licensing before counts in which they make their data Copyright and Database Rights.
they embark upon re-using such content in and models available for others. N Trends in legal certainty: Open Data
databases they construct themselves or in Licensing.
the course of their research. N ‘‘Informal’’ Openness and Open License
The four simple rules we have formu- Limitations.
lated for licensing data for open drug (PDF)
References
1. Williams AJ, Tkachenko V, Lipinski C, Tropsha 13. NeuroCommons (n.d.) NeuroCommons project. 22. Spjuth O, Willighagen EL, Guha R, Eklund M,
A, Ekins S (2009) Free online resources enabling Available: http://neurocommons.org. Accessed Wikberg JE (2010) Towards interoperable and
crowdsourced drug discovery. Drug Discovery August 2012. reproducible QSAR analyses: exchange of data-
World 10, Winter: 33–38. 14. Ruttenberg A, Rees JA, Samwald M, Marshall sets. J Cheminform 2: 5.
2. National Center for Biotechnology Information MS (2009) Life sciences on the Semantic Web: 23. Gupta RR, Gifford EM, Liston T, Waller CL,
(n.d.) The PubChem database. Available: http:// the Neurocommons and beyond. Brief Bioinform Bunin B, et al. (2010) Using open source
pubchem.ncbi.nlm.nih.gov/. Accessed August 10: 193–204. computational tools for predicting human meta-
2012. 15. Hastings J, Chepelev L, Willighagen E, Adams N, bolic stability and additional ADME/TOX
3. US National Library of Medicine (n.d.) ChemID- Steinbeck C, et al. (2011) The chemical informa- properties. Drug Metab Dispos 38: 2083–2090.
Plus Advanced. Available: http://chem.sis.nlm. tion ontology: provenance and disambiguation for 24. Derry JM, Mangravite LM, Suver C, Furia MD,
nih.gov/chemidplus/. Accessed August 2012. chemical data on the biological semantic web. Henderson D, et al. (2012) Developing predictive
4. Judson R, Richard A, Dix D, Houck K, Elloumi PLoS ONE 6: e25513. doi:10.1371/journal. molecular maps of human disease through
F, et al. (2008) ACToR–Aggregated Computa- pone.0025513 community-based modeling. Nat Genet 44:
tional Toxicology Resource. Toxicol Appl Phar- 16. de Rosnay MD (2008) Check your data freedom: 127–130.
macol 233: 7–13. a taxonomy to assess life science database
25. Murray-Rust P (n.d.) Dr Peter Murray-Rust.
5. EMBL-EBI (n.d.) ChEMBL. Available: http:// openness. Nature Precedings. Available: http://
Available: http://www.ch.cam.ac.uk/person/
www.ebi.ac.uk/chembldb/index.php. Accessed dx.doi.org/10.1038/npre.2008.2083.1. Accessed
pm286. Accessed August 2012.
August 2012. August 2012.
26. Wikipedia (n.d.) Open data. Available: http://en.
6. Pence H, Williams AJ (2010) ChemSpider: an 17. Ekins S, Williams AJ (2010) Precompetitive
online chemical information resource. J Chem preclinical ADME/Tox Data: set it free on the wikipedia.org/wiki/Open_data. Accessed August
Educ 87: 1123–1124. web to facilitate computational model building to 2012.
7. Galperin MY, Cochrane GR (2011) The 2011 assist drug development. Lab on a Chip 10: 13– 27. Open Knowledge Foundation (n.d.) Open data
Nucleic Acids Research Database issue and the 22. licensing. Available: http://wiki.okfn.org/Open_
online Molecular Biology Database Collection. 18. Zhu Q, Lajiness MS, Ding Y, Wild DJ (2010) Data_Licensing. Accessed August 2012.
Nucleic Acids Res 39: D1–D6. WENDI: a tool for finding non-obvious relation- 28. Murray-Rust P, Neylon C, Pollock R, Wilbanks J,
8. Williams AJ, Ekins S, Tkachenko V (2012) ships between compounds and biological proper- Open Knowledge Foundation Working Group on
Towards a gold standard: regarding quality in ties, genes, diseases and scholarly publications. Open Data in Science (2010) The Panton
public domain chemistry databases and ap- J Cheminform 2: 6. principles. Available: http://pantonprinciples.
proaches to improving the situation. Drug Discov ´
19. Azzaoui K, Jacoby E, Senger S, Rodrıguez EC, org/. Accessed August 2012.
Today 17: 685–701. Loza M, et al. (2012) Analysis of the scientific 29. Williams AJ, Ekins S, Clark AM, Jack JJ,
9. Williams AJ, Ekins S (2011) A quality alert and competency questions followed by the IMI Open- Apodaca RL (2011) Mobile apps for chemistry
call for improved curation of public chemistry PHACTS consortium for the development of the in the world of drug discovery. Drug Disc Today
databases. Drug Disc Today 16: 747–750. semantic web-based molecular information sys- 16: 928–939.
10. Fourches D, Muratov E, Tropsha A (2010) Trust, tem OPS. Drug Disc Today. In press. 30. Creative Commons (n.d.) ccREL: Creative Com-
but verify: on the importance of chemical 20. Williams AJ, Harland L, Groth P, Pettifer S, mons rights expression language. Available:
structure curation in cheminformatics and QSAR Chichester C, et al. (2012) Open PHACTS: http://www.w3.org/Submission/ccREL/. Ac-
modeling research. J Chem Inf Model 50: 1189– semantic interoperability for drug discovery. cessed August 2012.
1204. Drug Discov Today. In press. Available: http:// 31. Ekins S, Clark AM, Williams AJ (2012) Open
11. Halevy A, Norvig P, Pereira F (2009) The dx.doi.org/10.1016/j.drudis.2012.05.016. Ac- drug discovery teams: a chemistry mobile app for
unreasonable effectiveness of data. Intelligent cessed August 2012. collaboration. Molecular Informatics. In press.
Systems 24: 8–12. 21. Wikipedia (n.d.) InChIKey on the InChI Wikipedia doi:10.1002/minf.201200034.
12. Sansone SA, Rocca-Serra P, Field D, Maguire E, page. Available: http://en.wikipedia.org/wiki/
Taylor C, et al. (2012) Toward interoperable International_Chemical_Identifier#InChIKey. Ac-
bioscience data. Nat Genet 44: 121–126. cessed August 2012.
PLOS Computational Biology | www.ploscompbiol.org 3 September 2012 | Volume 8 | Issue 9 | e1002706