BioMed Central is a large open access publisher that is committed to open data initiatives. They have implemented several solutions to promote open data practices, including data journals, an open data award, and enabling data citation. They also work to integrate data hosting and deposition, address data licensing issues, and provide guidance on best practices. Future goals include adding more value to text and data mining applications and building business models around open data.
Implementing and Institutional Repository for Sharing, Archiving, and Accessi...
BioMed Central's open data initiatives
1. BioMed Central’s open data
initiatives
Alliance for Permanent Access conference
7th November 2012
Iain Hrynaszkiewicz
Publisher (Open Science), BioMed Central
iain.hrynaszkiewicz@biomedcentral.com
@iainh_z
2. About BioMed Central
• Launched in 2000, largest global publisher of peer-
reviewed open access journals (>240)
• >136,000 peer-reviewed open access articles published
• Part of Springer Science+Business Media since 2008
• Publish using Creative Commons (CC-BY) licenses
• Non-journal products include ISRCTN database
• Interested in innovation and recognise the growing need
for data sharing and publication
http://blogs.biomedcentral.com/bmcblog/tag/Open-Data/
3. BioMed Central and open data
• Increasing transparency in scientific research and
scholarly communication is at the core of strategy
• Data are an increasingly integral part of scholarly
communication, with many opportunities for increasing
the pace of knowledge discovery
• Publishers, particularly open access publishers, are well-
placed to share information across domain boundaries
http://www.biomedcentral.com/about/access
“By ‘open data’ BioMed Central means that these data are freely available on the public
internet permitting any user to download, copy, analyse, re-process, pass them to
software or use them for any other purpose without financial, legal, or technical
barriers other than those inseparable from gaining access to the internet itself. BioMed
Central encourages the use of fully open formats wherever possible.”
4. BioMed Central open data initiatives
• Data journals and article types
• Open Data Award
• Data hosting, citation, deposition and linking
• Lab notebook-journal integration (LabArchives)
• Data licensing
• Guidance and best practice e.g. human subjects –
confidentiality and consent
• Data formats and standards – efficient reuse
• Facilitation of data/text mining research
5. Problem: Lack of credit/recognition for
data sharing and publication
• In science credit is everything but incentives for data
publication are still emerging
• Datasets are not generally as discoverable and
citable as journal articles – yet
• Requirements for data sharing are field/location-
specific
• Need more empirical evidence of the benefits of data
publication for individual scientists
6. Solution #1: Journals and article types
enabling data publication
Data notes: “[B]riefly describe a biomedical data
set or database, with the data being readily
accessible and attributed to a source”
http://bit.ly/y3Jb3b
Research: E.g. The International Stroke Trial
database
http://www.trialsjournal.com/content/12/1/101
Data notes: “[E]xceptional datasets deposited
in our GigaScience repository that have been
selected for further peer review”
http://bit.ly/yPBsAA
7. Solution #2: Open Data Award
“We ... recognize
researchers who
have ... have
demonstrated
leadership in the
sharing,
standardization,
publication, or re-use of
biomedical research
http://www.biomedcentral.com/researchawards/opendata
data.”
8. Solution #3: Enable and
encourage/require data citation
“References
...
Only articles, datasets and abstracts that have been published or
are in press, or are available through public e-print/preprint servers,
may be cited
…
“Dataset with persistent identifier
Zheng, L-Y; Guo, X-S; He, B; Sun, L-J; Peng, Y; Dong, S-S; Liu, T-
F; Jiang, S; Ramachandran, S; Liu, C-M; Jing, H-C (2011): Genome
data from sweet and grain sorghum (Sorghum bicolor).
GigaScience. http://dx.doi.org/10.5524/100012."
http://blogs.biomedcentral.com/bmcblog/2012/01/19/citing-and-linking-dat
9. Problem: Where can data be stored –
permanently?
• Publishers not best placed to run repositories for long
term preservation of large datasets
• Mirrors of publisher content not able to accept
arbitrary amounts of additional data
• Many data repositories exist but most are
domain/location specific and there are many different
types of funding model, license agreement and
persistent identifiers in use
11. Editor-in-Chief: Editor: Assistant Editor:
Laurie Goodman, BGI (USA) Scott Edmunds, BGI (China) Alexandra Basford, BGI (China)
GigaScience publishes ‘big-
data’ studies from the entire
spectrum of life sciences
Benefits
• Novel publishing format -
manuscript publication and
data hosting
• Assignment of data DOIs
allows separate data citation
• The BGI is covering all APCs
for the first year after
launch
www.gigasciencejournal.
com www.biomedcentral.c
13. GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological
and biomedical research as it enters the era of “big-data”… (see more)
15. Anatomy of a GigaScience Publication
Idea
Study
Metadata
Data
Analysis
Answer
16. Solution #2: Comprehensive author
information on available data repositories
http://datacite.org/repolist
http://www.biomedcentral.com/about/su
17. Solution #3: Research on repositories
http://publicationethics.org/files/u661/EthicalEditing_Autumn2012_final.pdf
We are looking for
repositories with interests
in clinical research data –
can you help?
18. Problem: Data are not consistently
linked to publications
• Data deposition policies are not established in all
fields
• Even where they are links/accession numbers tend to
be inconsistently presented and rarely cited
• Researchers may, independently of journal
requirements, deposit data in repositories
• A missed opportunity to enhance the literature
19. Solution #1: ‘Availability of supporting
data’ article section
• A tool to put data deposition policies – encouraged or
mandated – into practice
• Provides links in a consistent place within an article to
supporting data, regardless of the location or format
of the data
• Data must be permanently available (DOI or
equivalent)
• ~50 journals including GigaScience, BMC series
http://www.biomedcentral.com/about/supportingdata
20. Availability of supporting data
BMC Res Notes 2012, 5:21 http://www.biomedcentral.com/1756-0500/5/21/
GigaScience 2012, 1:3 http://www.gigasciencejournal.com/content/1/1/3
21. Solution #3: Lab notebook integration
• BMC authors entitled to LabArchives’ (
http://www.labarchives.com/bmc) online lab notebook
with 100Mb of free storage
• Features include:
- Data publishing with DOIs assignment
- Citable, linkable data supporting publications
- Reusable/integrate-able data with CC0 waiver
- Integrated manuscript submission to BMC journals
- Additional free storage (standard is 25Mb)
http://blogs.openaccesscentral.com/blogs/bmcblog/entry/labarchives_and_biomed_central_a
23. 24 Oct 2012
Open data
partnership leads to
release of data
from Nobel Prize-
winning laboratory
for public use
http://www.biomedcentral.com/
presscenter/pressreleases/201
21024c
24. Problem: Licensing that restricts data
integration and (re)use efficiently
http://pantonprinciples.org/
“[P]eople mis-use copyright licenses on
uncopyrightable materials and data sets: the
confusion of the legal right of attribution in
copyright with the academic and professional
norm of citation of one's efforts. ” John
Wilbanks, VP, Science, Creative Commons,
http://bit.ly/djl5Fa August 11, 2010
“...any restrictions on use should be strongly
resisted and we endorse explicit encouragement
of open sharing.” Schofield et al.: Post-publication
sharing of data and tools. Nature 2009, 461:171.
“The data should be released in standardized
formats without intellectual property constraints. ”
Conway PH, VanLare JM: Improving Access to
Health Care Data: The Open Government
http://www.isitopendata.org/ Strategy. JAMA 2010;304(9):1007-1008.
25. Why Creative Commons CC0?
• interoperability: CC0 is human and machine-
readable
• universality: CC0 is global and universal and
widely recognized
• simplicity: no need for humans to make, and
respond to, individual data requests – avoids
“attribution stacking” with CC-BY licenses
Schaeffer P: Why does Dryad use CC0?
http://blog.datadryad.org/2011/10/05/why-does-dryad-use-cc0/
http://creativecommons.org/publicdomain/zero/1.0/
27. Public consultation on
implementing CC0 for
data published in open
access journals: closes
10 th November 2012
http://blogs.biomedcentral.com/bmcblog/
2012/09/10/put-the-open-in-open-data/
Hrynaszkiewicz I, Cockerill MJ:
Open by default: a proposed
copyright license and waiver
agreement for open access
research and data in peer-
reviewed journals. BMC Research
Notes 2012, 5:494
http://www.biomedcentral.com/1756-
0500/5/494
28. Implementing CC0 in journals – how?
• Specify a date from which the new license would
apply to data (CC-BY remains for other content)
• Only applies to data submitted to the journal
• Some relatively minor technical and operational
implications
• Cultural change may be the biggest challenge
• Consultation is identifying common concerns, FAQs,
and further definitions and use cases for open data in
journal publications
Hrynaszkiewicz I, Cockerill MJ: Open by default: a proposed copyright
license and waiver agreement for open access research and data in
peer-reviewed journals. BMC Research Notes 2012, 5:494
http://www.biomedcentral.com/1756-0500/5/494
29. Problem: Lack of guidance, exemplars,
incentives to make date reusable
• Sharing/publishing detailed human subjects data, in
the absence of explicit consent, can potentially
infringe privacy (ethically and legally)
• Data are more (re)usable if published in community
endorsed, standard formats
• Standards and appropriate guidance do not yet exist
in all domains
• Few incentives to follow data standards
30. Solution #1: Work with journal editors
to produce guidance where it is needed
BMJ 2010;340:c181
Co-published in:
Trials 2010, 11:9
33. Solution #3: Incentivize, promote and
share best practice and standards
http://www.biomedcentral.com/bmcresnotes/series/datasharing http://biosharing.org/standards_view
34. Problem: Adding value to data of use to
researchers, readers and publishers
• Text/data mining applications often are research
project or research specific and not always attractive
to commercial publishing platforms and their
customers
• Value to the non-expert can be limited
• Makes business model/case challenging for
publishers
We publish under the Creative Commons license Authors/copyright owners irrevocably grant to anyone the right to use, reproduce or disseminate the research article in its entirety or in part in perpetuity Article processing charge is typically levied for each accepted article. When we say open access we mean research which is fee to build upon, distribute and use with the minimum of barriers – Budapest definition. We launched our first data publishing journal, BMC Research Notes, in 2008.
In addition to service provision, as a successful open access publisher, increasing transparency is science communication is at the core of our strategy. And as a technology-driven company, we believe that data are and will become an increasingly integral part of the published scientific record. Moreover, publishers who serve a broad section of the scientific community are well placed to share information across domain boundaries i.e. Between scientists more likely to share data and those who are not. Our policy on open data is analogous to our policy on open access to papers, although as we will see the appropriate legal tools (licenses for data) are not the same.
Current projects include but are not limited to.... In other words BMC’s initiatives have been about removing barriers to data sharing and publication, and solving problems. I’m not going to dwell too much on why we should preserve, share and publish research data. Data publication has moved on from the Why? And on to the How?
So what are the problems we’re trying to solve? I imagine we all know that credit for data sharing is major barrier, and the mandates and incentives are often field-specific. Data sharing only happens routinely in some scientific fields (e.g. Physics, genomics have lead the way) Benefits of data sharing for science as whole, the economy and, in clinical research, patients, are well documented but evidence of the benefits to individual scientists and evidence of intelligent reuses are still emerging in some fields.
Data journals are now not a new idea, and many publishers now have data journals. They are helpful for overcoming the credit problem because they make data publication equivalent to publication of research. BMC has always had, since it launched in 2000, a “reproducible research friendly” ethos, encouraging submission of supplementary materials (additional files). And, for more than 5 years BMC has published journals that specifically aim to make raw data available, either as additional files (supplementary material) with research or as data-driven articles – data papers (we call them data notes). And in the case of data notes (papers) data are the primary purpose of publication. Data notes describe a biomedical dataset or database, with the data being readily attributable to a source. Furthermore, efficient online publication processes can facilitate dataset publication. Currently, only a fraction of experimental data sets make it into the literature and many more datasets have the potential to be useful, but do not warrant a traditional publication. There are different approaches to data publishing and a number of new journals/services are emerging (F1000, datasets international). Datasets can still be included as supplementary material (virtually unlimited numbers of files no larger than 20Mb per file). BMCRN: Can be additional files or elsewhere on the web GigaScience: Data in the GigaScience repository, Gigadb Trials: Data as additional files (so far)
Other solutions. Everyone likes getting prizes. Since 2010 we have had an open data category in our annual research awards, which aims to recognise researchers who have shown leadership in the sharing, standardization, publication or re-use of biomedical research data. The 2011 winner, awarded in 2012, is pictured here. More about the specifics of their dataset later.
STM Association announced endorsement for data citation in reference lists in June 2012. BMC has included data citation in the style of every journal it publishes since January 2012. and since July 2011 selected journals have encouraged or required data citation and linking consistently from published research.
We often hear debate about publishing additional files (supplementary material). Publishers not best placed to run repositories for long term preservation of large datasets Mirrors of publisher content not able to accept arbitrary amounts of additional data Long term preservation presents a challenge with respect to continuity BGI capable of sequencing ~2000 genomes per day (6 Tb/day = 2Pb/year)
GigaScience aims to revolutionize data dissemination, organization and use. Publishes big data studies across the life sciences. Includes a novel publication format which links manuscript publication with an extensive database that hosts all associated data and provides analysis tools and cloud-computing resources. Published first articles in July 2012. High-profile articles on e.g. - Puerto Rican parrot genome
But what if you want to publish everything – all your data, code (analysis tools). You want a fully executable account of your research, allowing others to explore, validate – and reproduce – your findings. Well, in data intensive science we have a solution for that problem too.
Will soon look like. Database accepts data and code for all research and data papers accepted for GigaScience journal. It also accepts important datasets supporting publications in other journals. Has data from Genome Biology, Nature. Uses the Creative Commons CC0 waiver for all data ensuring maximum potential for integration and reuse.
Data and code for one of the first research articles in GigaDB. About 80Gb of data. Large and linked in scientific publishing: the launch of ‘big data’ journal GigaScience BGI, the world’s largest genomics institute, and BioMed Central, a leader in scientific data sharing, aim to revolutionize science publishing with the launch of GigaScience , a new open access, open data journal with a scope that embraces all life science research that generates ‘big data’. This launch is a major first step towards the open access publication of complete, reproducible accounts of all parts of data-intensive scientific research projects. Together GigaScience and its integrated database Giga DB provide scientific analyses, full dataset hosting, and access to the software tools used to conduct these analyses, along with publication of more traditional scientific articles describing the studies. Having all these together finally allows readers to not only glean the scientific conclusions in the papers, but also to directly test these using the underlying data and analysis tools. In this way, GigaScience offers a way to help overcome the growing problem of the lack of reproducibility of research. GigaScience publications also include Digital Object Identifier (DOIs) for all datasets in the journal database, Giga DB. This helps make datasets more permanent, as well as fully track-able, discoverable, linkable, and citable, which traditionally has only been possible for journal articles. Citation enables scientists, who generate these enormous datasets and share them with the community, to gain more appropriate credit for their contributions to research. Laurie Goodman, Editor-in-Chief, says, “The full use of large-scale data has sadly lagged far behind our ability to produce it. The leaders of BGI realized they had the ability, given their vast computational resources, to create an innovative new journal format — one where enormous datasets could be fully hosted and directly linked to their original scientific studies. By including analysis tools in a data platform, as well as the planned addition of cloud technology later this year, GigaScience can serve as a means to put such data into the hands of researchers who do not have the vast computational resources required for optimal data use. This is in keeping with the goals of our co-publisher BioMed Central, which makes them the perfect partner in achieving this endeavour.” Exemplifying GigaScience and Giga DB’s innovative approach to publishing, in the launch edition, is a research article from Stephan Beck’s group at the University College London, UK. This article focuses on ways to conduct whole-genome analyses of DNA methylation, an important mechanism that regulates gene expression. The article contains all of the supporting data and software tools needed to recreate the experiments — a total of 84 GB — freely available for download and reuse from Giga DB. Using BGI’s data storage capacity, GigaScience is able to host these and other files, which are far larger than any other journals are able to publish. GigaDB furthermore supports open data by giving up all copyright in published datasets by its use of the Creative Commons CC0 public domain dedication waiver. This allows anyone to access and reuse published data without restrictions. This is part of a forward thinking, technology-driven approach to science publishing. As Publisher Iain Hrynaszkiewicz says, “We traditionally have only had access to limited amounts of scientific knowledge – usually articles in journals summarizing experiments – which means we do not reap the full benefits of research. Through GigaScience ’s open access, open data journal and database we are entering a new era of publishing, where large amounts of scientific data are as accessible, citable and interconnected as the literature which they support. BioMed Central are delighted to be leading this revolution in open data and science communication with GigaScience and BGI, which we hope will ultimately help make scientific research faster and more reliable.” As well as this innovative, big-data-driven publication format the journal also provides reviews and commentaries that address the many hurdles that still need to be surmounted to improve future big-data handling . BioMed Central and the GigaScience editors will be marking the journal’s launch at the ISMB conference 15-17 July 2012 (booth number 36). -ENDS- Media Contact Rebecca Fairbairn Public Relations Manager, BioMed Central Tel: +44 (0) 20 3192 2433 Mob: +44 (0) 7825 257423 Email: [email_address] Notes to Editors BioMed Central ( http://www.biomedcentral.com/ ) is an STM (Science, Technology and Medicine) publisher which has pioneered the open access publishing model. All peer-reviewed research articles published by BioMed Central are made immediately and freely accessible online, and are licensed to allow redistribution and reuse. BioMed Central is part of Springer Science+Business Media, a leading global publisher in the STM sector. BGI (formerly known as Beijing Genomics Institute) was founded in 1999 and has since become the largest genomic organization in the world. With a focus on research and applications in the healthcare, agriculture, conservation, and bio-energy fields, BGI has a proven track record of innovative, high profile research, which has generated over 178 publications in top-tier journals such as Nature and Science. BGI’s distinguished achievements have made a great contribution to the development of genomics in both China and the world. Their goal is to make leading-edge genomics highly accessible to the global research community by integrating industry’s best technology, economies of scale, and expert bioinformatics resources. BGI and its affiliates, BGI Americas and BGI Europe, have established partnerships and collaborations with leading academic and government research institutions, as well as global biotechnology and pharmaceutical companies. GigaScience (http://www.gigasciencejournal.com) is co-published by BGI, the world’s largest genomics institute, and BioMed Central, the world’s largest open-access publisher. The journal covers research that uses or produces ‘big data’ from the full spectrum of the life-sciences. It also serves as a forum for discussing the difficulties of and unique needs for handling large-scale data from all areas of the life sciences. The journal has a completely novel publication format — one that integrates manuscript publication with complete data hosting, and analyses tool incorporation. To encourage transparent reporting of scientific research as well as enable future access and analyses, it is a requirement of manuscript submission to GigaScience that all supporting data and source code be made available in the GigaScience database, Giga DB (http://gigadb.org), as well as in their publicly available repositories. GigaScience will provide users access to associated online tools and workflows, and will be integrating cloud resources into the database later this year, maximizing the potential utility and re-use of data. (Follow us on twitter @GigaScience; and keep up-to-date on our blogs http://blogs.openaccesscentral.com/blogs/gigablog/feed/entries/rss).
Consider the typical flow of scientific knowledge, and we only usually capture these last two steps. The aim is to publish “executable” papers, capturing all outputs of a project, enabling truly reproducible research. Doesn’t replace community expectations to deposit data in e.g. Sequence databases but GigaDB can co-host, and include the analysis tools – the latter is not possible in e.g. NCBI databases.
But of course few journals have this functionality. So therefore we’ve been working with DataCite, Digital Curation Centre and the British Library to provide a comprehensive list of >100 data repositories, which is linked from our instructions for authors. Also, if as a publisher we are encouraging data sharing and publication it’s a good service to our authors to tell them where it can be done.
As well as passively identifying potentially useful repositories we are trying to fill known “gaps in the market” through research. In medicine, there is no obvious/central/major repository. We recently received funding to work with researchers at Ottawa to address these gaps in knowledge and produce comprehensive information on the features and practices of existing repositories which have common interests, or potential interest, in sharing and public disclosure of clinical trials data. The methodology of this study includes reviewing existing resources that catalogue information of data repositories, such as Databib ( http://databib.org/ ), literature review, analysis of websites of repositories, and engagement of relevant stakeholders – such as interviews with repository managers. We aim to capture the methods of existing repositories for public disclosure of clinical data, and non-public forms of data sharing, such as the unique and persistent identification systems for datasets; the license; use or other agreements employed by the repositories; sustainability (business) models to understand how they have addressed the issues and summarize what are considered good practices. We also plan to gather information on how repositories define raw data and meta-data; data formats and standards; methodology of data preparation; privacy; standards of quality control; policies and terms of data inclusion and access to data for re-use ; system architecture ; features that encourage data sharing across geographical and domain boundaries, such as networks of repositories (federated) vs. centralized repository , and collaborations with other stakeholders in clinical research data including journals and publishers.
With all these repositories we know data archiving goes on in a number of disciplines/institutions but unless there is a community or journal/editorial mandate for data availability this will not be consistently checked as part of the peer review and publication process and documented in the article. If data are assigned persistent identifiers then they can potentially be linked to publications, and substantially enhance the reliability and reproducibility of the literature.
A number of journals require ‘data sharing statements’ from their authors (e.g. BMJ, Annals), which are a step in the right direction. But I’m more for data sharing, rather than data sharing statements . Different approaches to data deposition policies. So BMC offers a different approach –an article section- that enables consistent linking of datasets when they are permanently available online. In effect, it is a tool for editors and communities to put data deposition policies into practice. Makes clear to readers when they can access data as well as paper
Simple, standard statement that can be used for a variety of formats, and also when data are available in additional files.
Another way to link data publications is give authors their own personal repository, and the ability to permanently publish datasets. This is what LabArchives offers through its online lab notebook. So anyone can be a data publisher now. Integrated submission – manuscript templates and links to submission system
Last month a data paper was published in BMC Research Notes which described a data set from a Nobel Prize winning lab.
Back to trying to solve problems. To gain the full benefit from scientific data it must be free to build upon and reuse with the minimum of barriers, which includes legal barriers. Removing them requires data to be placed in the public domain and there have been loud calls for the right license or, in fact, no copyright license to be applied to research data. Problems with copyright and data Some examples here from Nature, JAMA, and prominent figures and initiatives in the open science/data community. Most permissive license under which journal content is typically published is CC-BY, even in open access journals.
A solution for making data maximally reusable in accordance with these open data principles – Creative Commons (again) CC0 public domain dedication waiver. For data this legal tool is attractive for a number of reasons. Waiver is a means to give up rights rather than assert them. And these reasons have been cogently expressed by someone else (Peggy Schaeffer at the Dryad repository), so I’ll paraphrase here. interoperability: Since CC0 is both human and machine-readable, other people and indexing services will automatically be able to determine the terms of use. universality: CC0 is a single mechanism that is both global and universal, covering all data and all countries. It is also widely recognized. Can be applied in all jurisdictions simplicity: there is no need for humans to make, and respond to, individual data requests, and no need for click-through agreements. This allows more scientists to spend their time doing science. Basically, it’s much more efficient.
So what is the solution? Copyright and licensing aren’t often the topics at the forefront of scientists’ minds. How do you engage scientists and other stakeholders in research on licensing and promotion of reuse of data?
Made a series of announcements including a draft position statement on open data, on the BMC blog. Invited contributions from the entire scientific community and held a series of working group meetings which led. This led to where we are today – development of a detailed proposal and protocol for implementing a variable/combined license agreement for OA journals, which places data in the public domain under CC0 and the remainder of papers under CC-BY. This went out for public consultation in September and is open until the end of this week.
So, how can a combined CC-BY and CC0 license agreement be implemented? On one hand it’s quite easy. The publisher just sets a date from which submitting authors must agree to the new license. You would need to address some relatively minor technical and operational tasks, such as what license information you encode in article XML and, if you publish it, RDF. Also, you need to have a robust system for handing opt-outs when the standard license is not feasible for a small number of authors. But on the other hand it’s quite complicated, as there may well be cultural objections to change. A way of addressing this is a public consultation, which was launched last week by BioMed Central, along with publication of a detailed white paper setting out all the issues and practical steps. Publishing platform developments Implementing a new license agreement will have technical as well as policy and procedural implications. Tagging of articles and data files published under a non-standard license agreement (where authors have opted out of the new default open access-open data license) Editing standard embedded license information in article XML metadata and RDF and a tool to automate insertion of non-standard licensing terms Insertion of license information to additional files and associated metadata Furthermore, the following would be desirable to enhance the discoverability and usefulness of open data in journal articles: Tagging and classification of published data files, for example by file type A tool to automatically discover and aggregate additional files A tool to (retrospectively) associate data objects with papers on the web Approaches to associating published datasets with journal articles which go beyond hyper-linking, such as through linked data methods Searching within and filtering of additional files
There are various domain specific challenges in preparing data for publication, which as a publisher we come to understand through our interactions with different communities. In clinical research patient privacy must be protected, but at the same time does not need to always be an excuse for not sharing data. Data standards are important for efficient data reuse but there are many of them, and not always the right guidelines and incentives for authors to adhere to them.
From 2008-10 we worked with some of the editors of a journal of clinical trials research, Trials, to help facilitate publication of raw data. We convened a working group involving publishing, funding, research ethics, editorial stakeholders. Developed practical guidelines on preparing clinical data for publication while maintaining privacy, and the alternatives when open access to clinical data is not possible. Published in the BMJ and in Trials in 2010.
This guidance was put into practice by the International Stroke Trial group, led by Peter Sandercock. They published a paper and dataset in Trials journal for the primary purpose of making the IPD, from one of the largest trials in acute stroke ever conducted, available for alternative analyses and reuse – comprising 19,000 IPD.
Which interestingly amounted to less than 5Mb of data. The data will be useful for a number of purposes including teaching and for planning future trials (the resources currently available in the developing world are fairly similar to the resources available for the IST in 1990s).
BMC Research Notes partnered with a group called BioSharing in Oxford to work collaboratively on the cataloguing and promotion of use of data standards. BMCNRN has since 2008 published data papers and encourages the publication of software tools, databases and data sets and a key objective of the journal is to ensure that associated data files will, wherever possible, be published in standardised, reusable formats and to define appropriate recommendations for domain-specific data file standards. As an incentive anyone developing a data standard for life sciences can publish a paper describing it and publishing an exemplary dataset in BMCRN – and BMC is waiving the article processing charge.
BioMed Central has, since 2003 or 2004, made its full text corpus of articles published under CC-BY available for bulk download for data mining research. Enabling the scientific community, without special permission, to explore and build on the open access literature as a scientific resource.
And we’ve always had our own visions of adding value to the open access literature. First mooted in 2007 when we launched our Journal of Medical Case Reports, our database of medical case reports is now soon to be launched. What is it?
It’s search, effectively. It uses TEMIS text mining technology and medical ontologies to mine XML from all BioMed Central journals which publish medical case reports. And we are working with other publishers to include their content in the database. Publishers who use the CC-BY license and deposit content in PubMed Central can be included efficiently.
Not a clinical decision support tool but can potentially supplement evidence-based resources when clinicians come across a situation where there isn’t a RCT or treatment guideline for the patient they are treating’s history, co-medications and co-morbidity. Also useful for teaching, for regulators, researchers.
Uses textmining technology. OA and OA licenses make textmining a much more efficient process. So as well as generating “the problem” (in air quotes), OA and the internet can also find solutions to information overload. By removing barriers to discovery and reuse of the literature we may ultimately improve the pace of research and patient outcomes.
Products such as CasesDB will help us deal with the inevitable information overload in the literature. OA may be driving growth in the literature but when done intelligently (with the right licenses) can also drive innovation and make the literature more useful.