A personal view of the big picture in Research Data Management, given at GFBio - de.NBI Summer School 2018 Riding the Data Life Cycle! Braunschweig Integrated Centre of Systems Biology (BRICS), 03 - 07 September 2018
A Big Picture in
Research Data Management
Carole Goble
The University of Manchester
Head of Node: ELIXIR-UK
Coordinator: FAIRDOM
Chair RDM User Group: University of Manchester
carole.goble@manchester.ac.uk
GFBio - de.NBI Summer School 2018 Riding the Data Life Cycle!
Braunschweig Integrated Centre of Systems Biology (BRICS)
03 - 07 September 2018
Stodden, Seiler, Ma. An empirical analysis of journal policy effectiveness for computational
reproducibility, PNAS March 13, 2018. 115 (11) 2584-2589;
https://doi.org/10.1073/pnas.1708290115
Since 2011
sharing/publishing assets in public archives…
Data Models
*top three most popular
The evolution of standards and data management practices in systems biology
(2015). Stanford et al, Molecular Systems Biology, 11(12):851
NIH Rigor and Reproducibility
https://www.nih.gov/research-
training/rigor-reproducibility
Plenty of
advice
cos.io/top
Plenty of Funder Data Policies
http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies
Pontika et al, Fostering Open Science to Research using a Taxonomy and an eLearning Portal at
iKnow: 15th International Conference on Knowledge Technologies and Data Driven Business,
http://dx.doi.org/10.1145/2809563.2809571
Open Science Taxonomy
https://wellcomeopenresearch.org/ Nature Scientific Data
Data Publishing and Citation
http://www.scholix.org/
https://datacite.org/
https://www.force11.org/datacitationprinciples
https://www.nature.com/sdata/
“The FAIR Guiding Principles for scientific data management and stewardship
Scientific Data 3, 160018 (2016) doi:10.1038/sdata.2016.18
Principles
Metadata
Identifiers
Access policies
Standards
Technical: Political
Social
Economic:
A rallying cry
….
Research Data Management
Retain
(or dispose)
Review
(replicate & validate)
Reproduce
(verify, compare)
By the
researcher
and their
collaborators
By their
peers, the
public and
competitors
(include,
combine)
Fifty Shades of FAIR
Workflows
SOPs
Containers, cloud services, common services
Packaging platforms (Research Objects)
Markup languages,
reporting guidelines and
checklists, ontologies,
catalogues
Sounds hard….Catalogues
Search markup
…. RDM Lifecycles
CollectionSharing
Stewardship Integration
Primary & secondary data,
models, SOPs
Metadata
Experimental context
Integration with
in house data infrastructuresFAIR
Organise & link assets
Standardised, consistent
reporting
Reproducible
publications
Yellow pages
Exchange among colleagues
How and when to share and
publish
Get and give credit
Retain and find beyond project
Span across legacy,
in house, external systems,
community archives
Integrate with tools, analysis platforms,
in house data infrastructures
Curation support
Capacity building
Metadata practices
Policies and governance
Knowing what to throw away
Do Research
Research Infrastructure
Services
Assemble
Methods, Materials Experiment
ObserveSimulate
Analyse
Results
Quality
Assessment
Track and Credit
Disseminate
Deposit &
Licence
Marketplace
Services
Publish
Share
Results
Any
research
product
Selected
products
Manage
Results
Science 2.0 Repositories: Time for a Change in Scholarly Communication Assante, Candela, Castelli, Manghi, Pagano, D-Lib 2015
Science 2.0
Repositories
101 Innovations in Scholarly Communication - the Changing Research Workflow, Boseman and Kramer, 2015,
http://figshare.com/articles/101_Innovations_in_Scholarly_Communication_the_Changing_Research_Workflow/1286826
A RDM Ecosystem
Team Science …….Of Individuals
Collaborating and Competing Simultaneously
Self-deposit, self-curating, variable stewardship skills
The RDMTeam…
A RDM Egosystem
FAIR RDM in the Team
multi-partner, multi-disciplinary projects
What methods are been used to determine
enzyme activity?
What SOP was used for this
sample?
Where is the validation data for this model?
Is there any group generating kinetic data?
Is this data available?
Track versions of my model
Whats the relationship between the data and
model?
Which data belong to
which publications?
Project Managed Spaces:
Organisation -> Sharing -> Dissemination
Project
Investigation
Programme
Self-controlled
spaces managed
spaces
One entry point
over external
systems
A Project Commons
X = data, software, method, article
I can access your X
Your X is (re)usable by me and with my tools/data
I get credit for using your X
You can’t use my X
Only access/use my X if I say so
I don’t have resources and skills to make my X
reusable and reproducible
I must get credit if you use X
Someone else will paying for X stewardship and archiving.
X will always be there & free for me.
Maturing this view.
FAIR RDM outside the Team
“Getting it published,
not getting it right”
Matt Spitzer, COS, Jisc-CNI
Leadership Conference 2018
Reuse Debt
Annotate
for
strangers
Organise
Share
Disseminate
Data decreases
Metadata
increases
Reach increases
• Metadata quality and
quantity
• Identifier hygiene
me
ME
my team
close
colleagues
peers
Access Spiral: Staged sharing
organisation – collaboration - dissemination
The number of assets
reduces
Reach of sharing
increases
The richness of metadata
needed increases
Burden of work increases
Data ScienceAnalytics
Machine learning
Discovery, New algorithms
Data stewardship
Standardisation, Harmonisation,
Annotation and enrichment,
Maintaining access, preserving
Software stewardship
Updates, versions, porting
Prep & Processing
Data wrangling & curation
Instrument pipelines
Simulation sweeps
Personal Productivity
reviewers want additional work
statistician wants more runs
analysis needs to be repeated
post-doc leaves,
student arrives
new/revised datasets
updated/new versions of
algorithms/codes
sample was contaminated
better kit - longer simulations
new partners, new projects
Means educating PIs
and Supervisors
Personal Productivity
Retention, reuse
Publish driven
Public Good
Sharing & Reproducibility
Access driven
Favourite excuses …
The results are embedded
in a figure in the paper
I don’t know where the data is
You can have it but the metadata is so
bad you will need me to interpret it
You can have it but only if you put me
on your paper
Pseudo Sharing
Data Flirting
Data
Hugging
The Reward Norms of Science… more later
You won’t credit me or cite my data but
you’ll demand work from me and use it
for your own research reputation…
Don’t have the
resources or skills
You will ask
me questions
Capitalising on investments
Retaining results post-project
Pooling, transfer, sharing results
Public collections
Skilling workforce
Compliance audit/metrics
Community productivity
Reproducibility
Productivity
Doing science with collaborators
Publishing & getting credit
Access to resources, results, collections
Retention of my results post student
Repeatability - reviewer wants more
Competitiveness, protecting assets
Managing costs
Compliance
StakeholderAccountabilityValues
overlaps, mismatches?
Stakeholder Agendas
New publishable assets
Business models
Reproducibility
Knowledge Exchange Report: http://www.knowledge-exchange.info/event/ke-approach-open-scholarship
RDM Knowledge Exchange
Public Good
Private Good
Institutional
Facility
Community
Organisation’s
Good
National centres
Publishers, Funders
Policy makers, Government
Public archives
Shared Infrastructure
Shared Data Centres
Global
National
Researcher
Personal
Researchers
Trainers
Students
PIs
Lab books
Group infrastructure
Data managers
Lab managers
Libraries
Institutional
repositories
Publishing in Public Central Repository Repertoire
Stanford et alThe evolution of standards and data management practices in systems
biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems
Biology (2015) 11: 851 DOI 10.15252/msb.20156053
The RDM Ecosystem
• public collections & archives
• data centres
• journals
• Institutional repositories
• most researchers
• labs & universities
• my resources
Stanford et alThe evolution of standards and data management practices in systems biology, Molecular Systems
Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Services & Activities Training
CommunitiesPolicy
Data,Tools, Compute, Interoperability
Engage
European
International
National
Industry
domains
technologiestechniques
RDM
select, support, and sustain public
and national data resources
support development of new ones
CDRs
DDs
NDRs
support and advocate for
standards, their adoption and
provide support services Identifiers.org
run registries, discovery and
analysis tools
coordinate integration efforts
BioTools
support researchers for their data management:
training, DMP, infrastructure, consultancy
by nodes for nodes in their national settings
Nodes
1k+ Databases
1k+ Standards
100+ Policies
https://dsw.fairdata.solutions
Data Stewardship Wizard
Practice identifier hygiene
A unique identifier for each record
800+ data collections
10 Rules for Identifiers
10 Rules for Selecting a BioOntology
200+ Ontologies
https://www.ebi.ac.uk/ols
https://doi.org/10.1371/journal.pbio.2001414
https://doi.org/10.1371/journal.pcbi.100743
A trusted virtual environment to store, share & re-
use research information.
Reduce reinvention. Avoid duplication
Simplify access. Support interdisciplinary re-use.
Serve Europe's 1.7 million researchers (of all disciplines) and 70
million science and technology professionals
Open Science
Move, share and re-use data
seamlessly
• across global markets and
borders
• among institutions and
research disciplines
• trusted free flow of data
• data infrastructure to store and
manage data
• high-speed connectivity to
transport data
• High Performance Computers
to process data
Realising the EOSC doi:10.2777/940154
A Research Commons?
collectively created, owned and shared, with governance
“… a cloud-based platform where investigators can store, share, access, and interact
with digital objects (data, software, etc.) generated from …. research.
By connecting the digital objects and making them accessible, the Data Commons is
intended to allow novel scientific research that was not possible before, including
hypothesis generation, discovery, and validation.”
https://commonfund.nih.gov/commons
Pooled Resources
Federation
Access
NIH Data Commons
• Overcoming fragmentation
– Across scattered resources, platforms, people
• Improving flow of information
– Coordination, collaboration
• Cumulative, dynamic
[original figure: Josh Sommer]
Cumulative
A Commons
Goble, De Roure, Bechhofer, Accelerating KnowledgeTurns, I3CK, 2013, isbn: 978-3-642-37186-8
http://fora.tv/2010/04/23/Sage_Commons_Josh_Sommer_Chordoma_Foundation
multi-object multi-repositories
Experimental context
All together
Type specific archives
Fragmented silos
Models
Presentations
events
Articles
Workflows
Samples
metadata
Data
StandardOperating
Proceduresversion,
tracking
provenance
parameters
citation
3 Studies
Model analysis,
construction, validation
24 Assays/Analysis
Simulations,
characterisations
16
19
13
2
1
Structured organisation
Retain context in one place
Deposit in the fragmented resources [Penkler, Snoep]
FAIRDOMHub : A Federated “Virtual”
Data Commons based on aggregation
http://fairdomhub.org
External
Databases
In House
Stores
Secure
Stores
Modelling
Resources
Distributed Commons,
Integrated View
Analytical
Resources
In progress
The First and Last Mile
“ramps” onto the Research Data Infrastructures
FAIR data at source – data deposition, validation and upload pipelines into
public repositories
FAIR access from my tools
Bench Benefit
The ‘last mile’ challenge for European research e-infrastructures https://doi.org/10.3897/rio.2.e9933
EOSC
Harvesting
Templates
Automation
Tracking
pipelines
Notebooks
Spreadsheet
wrangling
Data2Paper
Data
Tracking
Sheets
https://ncip.nci.nih.gov/blog/face-new-tragedy-commons-remedy-better-metadata/
“Creating good metadata takes considerable work ….
when investigators act in their own self-interest,
taking short cuts to generate metadata as quickly as
possible, we should expect that the overall utility of
the resource will decline.
… a need for easy-to-use solutions that are generic to provide
guidance over the entire life cycle of metadata — streamlining
metadata creation, discovery, and access, as well as supporting
metadata publication to third-party repositories”
Mark Musen
Stanford
The First Mile: Metadata at Source
Reduce complexity
Research
Infrastructure
Services
Assemble
Methods, Materials Experiment
ObserveSimulate
Analyse
Results
Quality
Assessment
Track and Credit
Disseminate
Deposit &
Licence
Marketplace
Services
Share
Results
Manage
Results
Building a FAIR Research Commons
Science 2.0 Repositories:Time for a Change in Scholarly Communication
Assante, Candela,Castelli, Manghi, Pagano DOI: 10.1045/january2015-assante
Mesirov,J. Accessible Reproducible Research Science
327(5964), 415-416 (2010)
Born FAIR
Elsewhere
on-date
Within
during
Research
Infrastructure
Services
Assemble
Methods, Materials Experiment
ObserveSimulate
Analyse
Results
Quality
Assessment
Track and Credit
Disseminate
Deposit &
Licence
Marketplace
Services
Share
Results
Manage
Results
Releasing
Portable
Reproducible
Objects
Science 2.0 Repositories:Time for a Change in Scholarly Communication
Assante, Candela,Castelli, Manghi, Pagano DOI: 10.1045/january2015-assante
Mesirov,J. Accessible Reproducible Research Science
327(5964), 415-416 (2010)
Supporting researchers to
make & exchange FAIR content
as they go… Credit for all products
Value quality
Data + the Methods
Packaging: data + methods + models
Scharm M,Wendland F, Peters M,Wolfien M,TheileT,Waltemath D SEMS, University of Rostock
zip-like file with a manifest & metadata
- Bundling files - Keeping provenance
- Exchanging data - Shipping results
Bergmann, F.T.,Adams, R., Moodie, S., Cooper, J., Glont, M., Golebiewski, M., ... & Olivier, B. G. (2014). COMBINE archive and OMEX format:
one file to share all information to reproduce a modeling project. BMC bioinformatics,15(1), 1.
Combine Archive
https://sems.unirostock.de/projects/combinearchive/
The Cinderella of RDM:
Standard Operating Procedures
Record your
processing
steps
Precision medicine NGS pipelines
Alterovitz, Dean, Goble, Crusoe, Soiland-Reyes et al Enabling Precision
Medicine via standard communication of NGS provenance, analysis, and
results, biorxiv.org, 2017, https://doi.org/10.1101/191783
Assemble, share, and analyze large and
complex multi-element datasets
distributed across multiple locations,
referenced because too big
Secure large scale moving of patient
data.
Chard et al I'll take that to go: Big data bags and minimal identifiers
for exchange of large, complex datasets,
https://doi.org/10.1109/BigData.2016.7840618
FAIR Exchange of Research Goods
Governance
Stewardship
Credit
Tracking
Lifecycles
Fixivity…
Arxiv,
my Lab
myExperiment
GitHub,
Web Service myWebSite
bioModels.org,
openModeller
PubMed
Spreadsheet in
figshare
ArrayExpress,
BioSamples,
PRIDE, GBIF,
my Lab,
institutional
repository
Overlaying the
Research Commons
Ecosystem
Tracking, credit mining, comparison, auto-
metadata, blockchain, boundary objects….
1
3
2
A FAIR KnowledgeWeb of Research Objects
Map across metadata
Threaded publications
Navigate, Pivot-Focus, Cite
Self-describing
Releasing Research: “within during”
Analogous to software products & practices rather than articles
An “evolving manuscript” would begin with a pre-
publication, pre-peer review “beta 0.9” version of an
article, followed by the approved published article itself, [
… ] “version 1.0”.
Subsequently, scientists would update this paper with
details of further work as the area of research develops.
Versions 2.0 and 3.0 might allow for the “accretion of
confirmation [and] reputation”.
Ottoline Leyser […] assessment criteria in science revolve
around the individual. “People have stopped thinking
about the scientific enterprise”.
http://www.timeshighereducation.co.uk/news/evolving-manuscripts-the-future-of-scientific-communication/2020200.article
Demands different
ideas of credit and
citation
Living Entry
Published Snapshot Entry
FAIRDOM Commons Releasing….
G. Penkler, F. DuToit,W. Adams, M. Rautenbach, D. C.
Palm, D. D.Van Niekerk, & J. L. Snoep. (2014).
Glucose metabolism in Plasmodium falciparum
trophozoites. FAIRDOMHub.
http://doi.org/10.15490/seek.1.investigation.56
Research
Infrastructure
Services
Assemble
Methods, Materials Experiment
ObserveSimulate
Analyse
Results
Quality
Assessment
Track and Credit
Disseminate
Deposit &
Licence
Marketplace
Services
Share
Results
Manage
Results
Releasing
Portable
Reproducible
Objects
Science 2.0 Repositories:Time for a Change in Scholarly Communication
Assante, Candela,Castelli, Manghi, Pagano DOI: 10.1045/january2015-assante
Mesirov,J. Accessible Reproducible Research Science
327(5964), 415-416 (2010)
Supporting researchers to
make & exchange FAIR content
as they go… Credit for all products
Value quality
Data + the Methods
FAIR Play: Walled Gardens
Open science applies to you but not me … not available = not citable
Jurgen Hannstra
Vrije Universiteit,
Amsterdam
Using FAIRDOM my
own lab colleagues
saw what I was
doing and called to
collaborate!
• Licenses
• Negotiated access
• Embargos
• Permission controls
• Staged sharing
• Private spaces
• enclave sharing
• consortia pressures
• within project mistrusts
• patterns (models vs data)
• hoarding & flirting
• personal dowries
• ex-member divorces
• asymmetrical reciprocity
• credit and citation
• “on date” not “during”
publishing
FAIR Play: RDM Stewardship
Value Systems
• of assets, of reproducibility, of
metadata
• public vs personal good
• economics of infrastructure
• priorities
• stewards and stewardship
• credit & reward
Sweatshops
• competing
• burden - time, skills
• short term, shortcuts
• untrained
• leadership sets the tone
The reward norms of
science
need to change
Everyone know this.
No-one knows how to fix it.
All research products and all scholarly labour
are equally valued
(except by institutional promotion boards,
funding panels, and review committees)
Data Journals
Data Citation
Data Policies: Open Data by Default
Credit & Citation
Infrastructure
(altmetrics based)
Data Stewardship Careers
Credit – giving and taking
CreDiT
Stop conflating credit with authorship
Getting people to cite data
Data Citation Metadata Landing Pages
Persistent
Identifiers
Data citation mining
https://project-thor.eu/
https://casrai.org/credit/ https://www.nature.com/articles/sdata201539
Making Data Count
Linking Data to Literature
https://www.project-freya.eu/
Stable & Sustained Infrastructure & Support
FAIR ≠ FREE
Countless expectations to do RDM
Much less in how to sustain the archives, infrastructure
and the skills needed
“we want FAIR data but we will only support research”
Complexity of funding federated commons with project-based national funds
Funding models need an update!
A Bigger RDM Picture
Fragmentation
Federation
Ecosystem
Embed in working practice
Born FAIR Ramps
First & Last
Mile
Egosystem
Stakeholders
Research Objects
Stewardship
Professionalisation
Cultural norms
Interoperability
FAIR is not FREE
Releasing
Credit, reward
What can you do?
Five steps to better data better research
Get expert help and give
stewards credit
Train yourTeam
incl. your PI
Publish your Data
and credit others
Develop a DMP
and resource it
Annotate for
strangers
Create analysis-friendly data
Record your processing steps
Use a unique identifier
for each record
Use standards
Save and backup raw data
Submit to a repository.
Get a DOI
Try to use platforms and tools
that work together
Acknowledgements
• David De Roure
• Tim Clark
• Sean Bechhofer
• Robert Stevens
• Christine Borgman
• Victoria Stodden
• Marco Roos
• Jose Enrique Ruiz del Mazo
• Oscar Corcho
• Ian Cottam
• Steve Pettifer
• Magnus Rattray
• Chris Evelo
• Katy Wolstencroft
• Robin Williams
• Pinar Alper
• C. Titus Brown
• Greg Wilson
• Kristian Garza
• Matthew Dovey
• Nick Juty
• Helen Parkinson
• Juliana Freire
• Jill Mesirov
• Simon Cockell
• Paolo Missier
• Paul Watson
• Gerhard Klimeck
• Matthias Obst
• Jun Zhao
• Pinar Alper
• Daniel Garijo
• Yolanda Gil
• James Taylor
• Alex Pico
• Sean Eddy
• Cameron Neylon
• Barend Mons
• Kristina Hettne
• Stian Soiland-Reyes
• Rebecca Lawrence
• Michael Crusoe
• Raphael Jimenez
• Alasdair Gray
Jon OlavVik,
Norwegian University of Life Science
Maksim Zakhartsev
University Hohenheim, Stuttgart,
Germany
Alexey Kolodkin
Siberian Branch
Russian Academy of Sciences
Tomasz Zieliński,
SynthSys Centre
University Edinburgh, UK
Martin Peters, Martin Scharm
Systems Biology Bioinformatics
University of Rostock, Germany
Hadas Leonov