A 45min presentation given at the 'Getting published in Nature's Scientific Data journal', hosted by the University of Cambridge Research Data Management team (www.data.cam.ac.uk). Presented on Monday 11th January 2016.
1. How to share useful data
Peter McQuilton
Biosharing.org
@drosophilic
2. Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers
3. Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers
6. A community mobilization for “openness”
image by Greg Emmerich
http://discovery.urlibraries.org/ https://okfn.org
Open data
is a means to do
better science
more efficiently
http://pantonprinciples.org
https://creativecommons.org
11. “Reproducing the method took several months of effort, and
required using new versions and new software that posed
challenges to reconstructing and validating the results”
Unfairness in both experimental and computation
areas
12. • Not always well cited, stored
o Software, codes, workflows are hard(er) to get hold of
• Poorly described for third party reuse
o Different level of detail and annotation
• Curation activities are perceived as time consuming
o Collection and harmonization of detailed methods and
experimental steps is rushed at the publication stage
Not very FAIR: low findability and
understandability
13. • Effectively document your data so that it can be understood
in the future
• Periodically move data to new storage media (drives
degrade over time)
• Keep more than one copy of data (local and cloud)
• Migrate data to new software versions
• Use a well documented and supported format
Ideally this should be covered in a data management plan at
the start of a project, so that you can factor any associated
time and resources into your budget.
What can I do to ensure my data are
shareable/usable in the future?
14. Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context - standards
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers
15. Do you know what this is?
LS1_C2_LD_TP2_P1 file1-fastq.gz
16. …how NOT to report the experimental
information!
LS1_C2_LD_TP2_P1 file1-fastq.gz
17. …how NOT to report the experimental
information!
Sample name (?!) Data file
LS1_C2_LD_TP2_P1 file1-fastq.gz
18. We need to clearly describe the information
• LS1 liver sample 1
• C2 compound 2
• LD low dose
• TP2 time point 2
• P1 protocol 1
• file1-fastq.gz compressed data file for sequence
information corresponding to this
sample
Sample name (?!) Data file
LS1_C2_LD_TP2_P1 file1-fastq.gz
23. • We need to report sufficient
information to reuse the dataset
• We must strike a balance between
depth and breadth of information
Information intensive experiments
25. Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared…
From natural language to ‘computable’ concepts
26. Age value?
Unit?
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
27. Age value
Unit
Strain name?
Subject of the experiment?
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
28. Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition?
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
29. Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part?
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
30. Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
31. Age value
Unit
Strain name
Subject of the experiment
Type of diet and
experimental condition
Anatomy part
Seven week old C57BL/6N mice were treated
with low-fat diet.
Liver was dissected out, hepatocytes prepared …
From natural language to ‘computable’ concepts
Type of protocol – cell preparation
Type of protocol - sample treatment
Type of protocol – liver preparation
32. How do you know what to report, or how to
structure it?
• Data/content standards:
• Structure, enrich and report the description of the
datasets and the experimental context under which they
were produced
• Facilitate the discovery, sharing, understanding and
reuse of datasets
33. Outline
• Data sharing
• Reusability and reproducibility
• How the lack of these affects scientific accountability and progress
• Experimental context
• What to report – what level of granularity
• How to report it – what format, structure
• Content standards
• How to find them
• Complying with repositories, funders and publishers
38. Enablers: to better describe, share and query data
• Minimum information
reporting requirements, or
checklists
o Report the same core,
essential information
39. • Minimum information
reporting requirements, or
checklists
o Report the same core,
essential information
• Controlled vocabularies, taxonomies,
thesauri, ontologies etc.
o Use the same word and refer to the same
‘thing’
Enablers: to better describe, share and query data
40. • Minimum information
reporting requirements, or
checklists
o Report the same core,
essential information
• Controlled vocabularies, taxonomies,
thesauri, ontologies etc.
o Use the same word and refer to the same
‘thing’
• Conceptual model,
conceptual schema, or
exchange formats
o Allow data to flow from one
system to another
Enablers: to better describe, share and query data
41. A web-based, curated and searchable registry ensuring that biological
standards and databases are registered, informative and discoverable; also
monitoring the development and evolution of standards, their use in databases
and the adoption of both in data policies.
42. Researchers, developers and curators lack support and guidance on how to best navigate and select
content standards, understand their maturity, or find databases that implement them;
Funders, journals and librarians do not have enough information to make informed decisions on which
content standards or database to recommended in policies, or fund or implement
Our mission: To help people make the right choice
47. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Search and filter to find what is relevant to your type of data
48. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and
substitutions
49. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
Tracking evolution, e.g. deprecations and
substitutions
53. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
5
3
User profiles populated from ORCID...
54. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
5
4
... credit for creating, contributing to, maintaining standards, databases and
policies
Ownership of open standards can be problematic in
broad, grass-root collaborations
It requires improved models, to encourage
maintenance of and contributions to these
efforts, rewards and incentives need to be
identified for all contributors to supporting the
continued development of standards
55. What you can do with BioSharing…
“Which standard should I use for this data, considering I’d
like to publish in journal X?
“Are we using the most up-to-date version of this standard?”
“My data is in X format, which databases take that format?
56. How can you use community-standards?
model and related
formats
These tools and formats will help you to:
57. The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta
Sansone www.ebi.ac.uk/net-project
ISA powers data collection, curation resources and repositories, e.g.:
ISA
model and related
formats
58.
59. 1
Create template(s) to fit the type of
experiments to be described
Create templates detailing the steps to
be reported for different investigations,
complying to community standards in
e.g. configuring the value(s) allowed for
each field to be
• text (with/without regular expressions),
• ontology terms,
• numbers etc.
We have ‘ready to use’ community
standards compliant configurations
and can create more according to
user needs
60. • The ISA model records the data’s provenance, how it was generated and
where it is located.
• Published Data Descriptors are indexed in all major bibliographic indexing
services (incl. PubMed)
• However, accompanying every Data Descriptor article there are metadata files,
specifically created to aid discovery and understanding of the data itself.
• Using the ISA (Investigation, Study, Assay) model, these metadata files
provide a machine readable overview of the study that generated the data.
61. • Filter datasets by
data repository or
metadata
• Boolean searches
• Future enhancements:
- Statistics
- Richer queries based
on semantics of the data
ISA-explorer: A demo tool for discovering and exploring Scientific
Data’s ISA-tab metadata
62. ISA-explorer: A demo tool for discovering and exploring Scientific
Data’s ISA-tab metadata
Visualise the data
associated with
a paper
http://tinyurl.com/isaexplorer
63. • Reusability and reproducibility
o Is pivotal to drive science and discoveries
o Do your best to make your digital research outputs FAIR
• Experimental context
o Report the experimental context of your findings
o Do to your data what you wish that others would do to theirs
• Content standards
o Continuously evolving
o Make use of tools implementing standards, such as ISAtools
o Use biosharing.org to explore repositories, standards and policies
Summary
65. Find the right database for your data, and which data standard to
use – https://www.biosharing.org
Checking your data conforms to a standard, or making your own
templates – http://www.isa-tools.org
Where to keep research data: DCC checklist for evaluating data
repositories (DCC) - http://tinyurl.com/DCCResearchData
How and why you should manage your research data (JISC) -
http://tinyurl.com/JISCDMP
Useful links