Laurie Goodman on "Overcoming Hurdles to Data Publication" for the Alan Turing Institute Symposium on Reproducibility for Data-Intensive Research, Oxford, 7th April 2016.
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
Laurie Goodman: Overcoming Hurdles to Data Publication
1. Overcoming Hurdles to
Data Publication
Laurie Goodman, PhD
Editor-in-Chief GigaScience
ORCID ID: 0000-0001-9724-5976
@GigaScience
(Personal Twitter Acct @Grimhawk1- but this is mostly me whining about Donald Trump,
Pitbull Discrimination, and why I hate TSA and Homeland Security)
2. Why should we “publish” data?
1. Ioannidis et al., (2009). Repeatability of published microarray gene expression analyses. Nature Genetics 41: 14
2. Ioannidis JPA (2005) Why Most Published Research Findings Are False. PLoS Med 2(8)
Out of 18 microarray papers, results
from 10 could not be reproduced
3. Deconstructing a paper into accessible,
useable, trackable, interlinked units
Need to provide credit to
reward sharing and proper
organization of:
• Narrative
• Data/Metadata
availability/curation
• Source Code, Software
availability
• Interoperability
• Availability of workflows
• Transparent analyses
Data/
MetaData
Source Code,
Software
Methods
Narrative
4. Data Sets in
GigaDB
Analyses in
GigaGalaxy
Paper in
GigaScience
Open-access journal
Data Publishing Platform
(under CC0 waiver)
Data Analysis Platform
How we view publishing at GigaScience
DOIs from
5. GigaScience Publishes (or links to) All Research Objects
Article (Narrative) + Data + Software + Source Code +
Methods + Workflows + Containers/Docker + VMs
Data sets in
GigaDB
Analyses in
GigaGalaxy
GigaScience
paper
Workflow
DOI
Data
DOI
+ +
6. What is Data Publication?
1. Publishing a standard article that describes
the data.
2. Making the data itself citable.
7. Make it easy to cite
See where it got cited!
Describe
the data
9. ?
Data Publication Hurdles
If only it were easy…
• Data isn’t “scholarly” enough to be a
citable entity (a ‘real’ paper)
• If I publish my data, I may not be able
to publish the analysis paper later
because journals will consider it Prior
Publication
• If I publish my data, #DataParasites
will use it!!*
*http://www.nejm.org/doi/full/10.1056/NEJMe1516564
Response from Functional Genomics Data Society:
http://fged.org/projects/data-sharing-and-research-parasites/
12. BUT #dataparasites!
Polar Bear Data were used before the data producer’s analysis paper was published—
But it garnered 5 citations.
Hailer, F et al., Nuclear genomic sequences reveal that polar bears are an old and distinct
bear lineage. Science. 2012 Apr 20;336(6079):344-7. doi:10.1126/science.1216424.
Cahill, JA et al., Genomic evidence for island population conversion resolves conflicting
theories of polar bear evolution. PLoS Genet. 2013;9(3):e1003345.
doi:10.1371/journal.pgen.1003345.
Morgan, CC et al., Heterogeneous models place the root of the placental mammal
phylogeny. Mol Biol Evol. 2013 Sep;30(9):2145-56. doi:10.1093/molbev/mst117.
Cronin, MA et al., Molecular Phylogeny and SNP Variation of Polar Bears (Ursus
maritimus), Brown Bears (U. arctos), and Black Bears (U. americanus) Derived from
Genome Sequences. J Hered. 2014; 105(3):312-23. doi:10.1093/jhered/est133.
Bidon, T et al., Brown and Polar Bear Y Chromosomes Reveal Extensive Male-Biased Gene
Flow within Brother Lineages. Mol Biol Evol. 2014 Apr 4. doi:10.1093/molbev/msu109
http://blogs.biomedcentral.com/gigablog/2014/05/14/the-latest-weapon-in-publishing-data-the-polar-bear/
13. However, this paper didn’t include the data citation…
The Data Publication has since garnered 6 more citations
Even though the data had
been released 2 years earlier
and been cited in other
papers- The main analysis
paper was published in Cell
Analysis
Paper was
published
in Cell.
(And made
the cover)
15. How are Data Citations Doing Overall?
Proportions of Citation Types Per Year
https://blog.datacite.org/location-of-the-citation/
Looked at 1,125 Journal Articles with
associated data in Dryad from 2011-2014
The Location of the Citation: Are Data Citation
Recommendations Having an Effect?
Elizabeth Hull, DataCite Blog
Highlights:
• Dryad DOI in the works cited, as
recommended = only 6% of total
articles
• Dryad DOI in the body only
(including data availability sections)
= 75%
• No citation (Dryad DOI not found
anywhere in the article) = 20%
Good News:
• Works cited in references increased from 5%
to 8% from 2011-2014
• Articles with no data citation declined from
31% to 15%
Bad News: With Current Growth Rate- expect to see 90% in works cited section in 2031
16. More Education Needed
“Easiest” Way Forward is to Engage the Journal Community
• Organizations providing citation guidelines should engage
“Editor Evangelists”
• Editor Evangelists will do the following:
o Get Data Citation Guidelines in the Guide To Authors
o Get Data Citation Guidelines in the Copy Editor
Handbook
o Tell All their Editor Friends and Get a Cult following
Example: The Standardization of Gene Nomenclature in articles
• The Human Genome Organization (HUGO) worked with journal editors in the
late 1990s to drive use of appropriate Gene Nomenclature, getting it into the
guide to authors.
• Within about ~3 Years, standard nomenclature use was used by all
Oh- and don’t forget to have the Editors tell the Production Department
that DOIs shouldn’t be stripped out and replaced with URLs.
17. Thanks to:
Scott Edmunds, Executive Editor
Nicole Nogoy, Commissioning Editor
Peter Li, Lead Data Manager
Chris Hunter, Lead BioCurator
Xiao (Jesse) Si Zhe, Database Developer
Sam Rose, Journal Development Manager
Rob Davidson, Open Data Lead,
Office for National Statistics
editorial@gigasciencejournal.com
database@gigasciencejournal.com
@GigaScience
facebook.com/GigaScience
blogs.openaccesscentral.com/blogs/gigablog
Contact us:
Follow us:
http://gigascience.biomedcentral.com
www.gigadb.org