Plenary Lecture Presented at INCF Neuroinformatics 2019 https://www.neuroinformatics2019.org
Title: FAIRy stories: tales from building the FAIR Research Commons
Findable Accessable Interoperable Reusable. The “FAIR Principles” for research data, software, computational workflows, scripts, or any kind of Research Object is a mantra; a method; a meme; a myth; a mystery. For the past 15 years I have been working on FAIR in a range of projects and initiatives in the Life Sciences as we try to build the FAIR Research Commons. Some are top-down like the European Research Infrastructures ELIXIR, ISBE and IBISBA, and the NIH Data Commons. Some are bottom-up, supporting FAIR for investigator-led projects (FAIRDOM), biodiversity analytics (BioVel), and FAIR drug discovery (Open PHACTS, FAIRplus). Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. There are villains and heroes. Some have happy endings; all have morals.
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
FAIRy stories: tales from building the FAIR Research Commons
1. FAIRy stories
Tales from building the
FAIR Research Commons
Carole Goble
The University of Manchester, UK
ELIXIR UK Head of Node
FAIRDOM Coordinator
Software Sustainability Institute UK
carole.goble@manchester.ac.uk
INCF Neuroinformatics 2019, Warsaw, September 1-2, 2019
2. FAIR Guiding Principles for Scientific Data Management and
Stewardship, Scientific Data 3, 160018 (2016)
doi:10.1038/sdata.2016.18
3. A Digital Object Research Commons
organising DOs for a field and across fields.
A “shared space” where investigators can store, share, access, connect
and interact with digital objects generated from research, and use
them. Not a Database or Data warehouse.
repositories
zoo
registries
zoo
https://medium.com/@rgrossman1/a-proposed-end-to-end-principle-for-data-commons-5872f2fa8a47 [Bob Grossman, 2018]
4. A Digital Object Research Commons
organising DOs for a field and across fields.
A “shared space” where investigators can store, share, access, connect
and interact with digital objects generated from research, and do
more data-intensive research. Not a Database or Data warehouse.
Ecosystem of pooled
community resources
Federation with many entry
points
Collectively created, owned or
shared by community
Mixed degrees of control
6. We are all trying to build
A FAIR Research Commons
7. We are all trying to build
A FAIR Research Commons
8. An “ad hoc” commons “in the Wild”
Using FAIR as a general principle
Fragmented
ecosystem of pooled
community resources
Distributed
federation with many
entry points and
many providers
Each has its APIs,
Web interfaces, Data
Submission,Tool
deployment
23 countries
15 communities
Including health
Held together with standards, metadata
mark-up, common identifiers, registries,
workflows, shared vision, hard work, love
and hope.
National
datasets
Community,
Public datasets
http://elixir-europe.org
9. Uber FAIR Life Science Commons
Federation over an ecosystem of different fields
Ecosystem of
FAIR innovative
tools
Publish
FAIR life
science data
A zoo of Catalogues of
tools, data, workflows,
computing resources …
10. Our first FAIRy tale:
Finding* stuff in a pre-existing ecosystem
EOSC Dataset Minimum Information
https://eosc-edmi.github.io/
Minimum information
metadata guideline to find and
access datasets reusing existing
data models and interfaces.
Conventions for using
schema.org
Find, Access and Index
Google Dataset Search
Small, Lightweight, Viral
A little bit of Semantics everywhere
*and a bit of provenance, licencing
11. Our first FAIRy tale:
Finding* stuff in a pre-existing ecosystem
Structured data descriptors in web
pages
Low barrier universal mark-up
Harvesting, indexing, search
Exchange & register without API
Automated curation
A little bit of Semantics everywhere
*and a bit of provenance, licencing
12. Our first FAIRy tale:
Finding* stuff in a pre-existing ecosystem
A little bit of Semantics everywhere
The Goldilocks Principle
14. Data Exchange: Without an API
MarRef → BioSamples
https://github.com/EBIBioSamples/bioschemas_marref_demo/blob/master/Summary.md
Bioschemas markup added
to MarRef pages
Markup crawled using BuzzBang
Data included as a BioSample Curation
15. A happy ending approaches
Endorsed by ELIXIR
First types -> Schema.org
Goldilocks
• Esp. good for small data providers
• Types & Profiles debates/explosion
• Domain ontology reuse challenges
• Elegance vs best for tools
• Trolls
Community based demonstration (Toxicology, Rare Disease)
Validation, mark-up & harvesting tools
A subset
of the
FAIR
Principles
16. Is your resource FAIR?
Is your data/workflow/model FAIR from first to last?
The FAIR Data Principles
vs
FAIR the Nice Intention
17. 2014 - Lorentz workshop
2015 - BioHackathon
2016 - Published
Grassroots activity that has
become a top down one.
Many efforts before….
Scientific Data 3, 160018
(2016)
doi:10.1038/sdata.2016.18
2nd Story: FAIR.
Once Upon A Time…
19. https://www.incf.org/activities/standards-and-best-practices/what-is-fair
Machine and human readable
data formats and metadata
that is compliant to many
community standards, that
persists, and tells you the
provenance of the data and
how its cross-linked
Data and metadata are
locatable and accessible by
GUIDs, standard access
protocols and have the least
restrictive licenses
23. Simple words are powerful
things that can be mangled.
Simple concepts are not so
simple to implement.
Once size does not fit all.
Beware FAIR zealots and
vested interests.
24. We { are | will be | always have been } FAIR
Use our platform /technology to be FAIR.
Even if its not what FAIR meant
Only we control FAIR.
Our way is the right way.
We don’t know what it means to implement
FAIR but we want to measure and certify it.
25. “FAIR principles: interpretations and implementation considerations” J Data
Intelligence, coming soon in 2019…. which was still contentious
26. FAIRy tale -> Reality!
• An aspiration, a journey.
• A call for machine actionability. of
data and metadata.
• Ambiguous.
• A spectrum.
• Domain respectful.
• Implementable with todays
protocols and standards.
• A subset of indicators:
– ROI cost/benefit, impact, community
need, sustainability of repository,
quality of content/service….
• Work in progress.
Principles are… Principles are not…
• A standard.
• Just about humans.
• Strict.
• Technology specific.
• Only for one domain.
• About inventing new
protocols.
• One size fits all.
• Anything to do with quality.
• Synonymous with open.
• Tablets of stone.
• Mons et al Cloudy, increasingly FAIR; Revisiting the FAIR Data guiding principles for the European Open Science Cloud. Information Services &
Use. 37. 1-8. 10.3233/ISU-170824.
• Dunning et al Are the FAIR Data Principles fair? IDCC17
27. FAIRy Stories about FAIR
• Its not about Open
• Its not about a resource’s
Quality or Impact
• Its not actually about
Harmonising all metadata to
one schema.
29. FAIR is a Journey….
Concepts
for FAIR
Impleme
ntation
FAIR
Culture
FAIR
Ecosystem
Skills
for
FAIR
Incentives
and
Metrics
Invest
ment
in FAIR
Turning FAIR into Reality, EC Report, 2018
30. Review Criteria for Endorsement of Standards and
Best Practices, 2018 DOI: 10.5281/zenodo.2535741
Subset of principles
applied to standards
and best practices
The INCF Commons
and its Resources
themselves?
“INCF supports the FAIR (Findable, Accessible, Interoperable,
Reusable) principles, and adherence to them is a requirement for
an INCF-endorsed standard or best practice.”
https://www.incf.org/activities/standards-and-best-practices
31. Defining and Implementing FAIR
Clarity
Metrics / Indicators
Maturity Models
Manual / Automated Assessment
FAIRification Methodologies
• At the first mile
• At the last mile
• For the legacy
Toolkits,Tools and Services
35. Matrix of indicators
Maturity levels for
each
+
*The MetricTide, https://responsiblemetrics.org/the-metric-tide/
A FAIR Assessment
Transparent
evaluation
What, Who, How
Objective evaluations
Narrative feedback on fails
Indicators
Robustness,
Humility,
Transparency,
Diversity,
Reflexivity*
Context
Community standards
Incremental
Cost/benefit
Not just a score
Non judgmental
Scope for novelty
Transparent evaluation Eat the Dog Food
Design-Build-Test-Learn
indicators and evaluation
36. Maturity Model
Value Based Assessment
Selection
Goal Setting
Process planning
Modelling
Transformation
Publishing
[Susheel Varma]
A FAIR Assessment
37. Capability Maturity Model
of entities & their capabilities
Indicators and metrics
measuring levels
Foundational
Components
FAIRification
Process
Awareness and Policy
Standards and Guidelines
People
Infrastructure
Value Based
Assessment
Selection
Goal Setting
Process planning
Modelling
Transformation
Publishing
Impl.Outcome:
Dataset
Persistent Identification
Data Set Discovery
Machine Readability
DataAccess and Usage
Preservation and Sustainability
FAIR Data Maturity ModelWG
A FAIR Assessment
[Oya Deniz Beyan, 2019]
38. Next meeting 12th September 2019
Sessions at Helsinki RDA Plenary October 2019
39. Licence
Metadata includes information about the licence under which the data can be reusedMandatory
Metadata includes licence information in the appropriate element of the metadata
standard used
Metadata refers to a standard reuse licenceRecommended
Metadata includes information about consent for reuse (e.g. personal data)
Metadata refers to a machine-understandable reuse licenceOptional
FAIR Data Maturity ModelWG
An “easy” indicator….
“R1.1. (meta)data are released with a clear and
accessible data usage license”
Format Allows
-- -- --
non-standard human readable access
standard
open standard reuse
& machine readable
clear reuse criteria
“
“
“
“
“ “
“
40. A trickier indicator…
“R1.3. (meta)data meet domain-relevant
community standards”FAIR Data Maturity ModelWG
Mandatory
Recommended
Metadata complies with a community standard
Data complies with a community standard
Metadata is expressed in compliance with a machine-understandable community standard
Data is expressed in compliance with a machine-understandable community standard
Neuroshapes
Metadata
Portal
Reviewers
Suppose there isn’t a standard or its not up to it?
Indicators have to be community specific
Librarian’s view point vs Genomics view point?
How is it validated? JSON and SHACL validators.
How is it captured? Spreadsheets.
Interoperability is nearly always purpose specific
41. • Community governed “indicators” not metrics
• Automated objective scale up & out
• Sanity check put into practice
https://fairsharing.github.io/FAIR-Evaluator-FrontEnd/#!/
Community
creates
Maturity
Indicators
Registered,
Collections
Compliance
tests written,
registered
Resource
tested
from a
starting
identifier
Report,
(Registered)
Wilkinson et al “Evaluating FAIR Maturity Through a Scalable, Automated, Community-
Governed Framework” bioRxiv, https://doi.org/10.1101/649202 , 2019
42. “FAIRification” (of legacy datasets)
the new magic wand word
• Need to do at the same
time as define indicators
• Needs experts
• BYODs
• ROI cost/benefit step
• Muddle with
harmonisation pipelines
(compliance to I and R)
• Non-trivial
• Upstream
• Turning into a business
https://fairplus-project.eu/
https://www.go-fair.org/fair-principles/fairification-process/
43. FAIR needs
to be at the
“first mile”,
embedded into
investigator
practice.
Mark Wilkinson
Just saying you are
FAIR doesn’t make
it true. Its uneven
and multi-facetted.
Identifier use is
chaotic.
Separating
metadata and data
is problematic.
FAIRification is
non-trivial.
FAIR is a set of
behaviours
not a specific
technology
44. Commons for autonomous,
self-managing Sys Bio projects
Hubs for Projects,
People, Data, Models, SOPs,
Workflows, Samples
First Mile /
Last Mile
From the
infrastructure /
standard /
commons /
database / tool / *
To the actual
investigator
fair-dom.org, fairdomhub.org
46. Respect and bridge the ecosystem
federated catalogue, integrated context
Public database
Local store
National infrastructure
Secure store
Public model
repository
Github
Shared SOPs
47. Neylon, Knowledge Exchange Report: http://www.knowledge-exchange.info/event/ke-approach-open-scholarship
Respect and bridge the ecosystem
going the first mile, and the last mile*
A miracle of sweat
and tears here
different scales, different agendas, different incentives
Koureas, The ‘last mile’ challenge for European research e-infrastructures https://riojournal.com/article/9933/
New ELIXIR
Converge
project
49. TheTragedy of the FAIR Commons*
• A Commons is only a
FAIR as its tenants
• Project sovereignty
• Public good vs personal
burden
• Professional
Stewardship for Projects
• Community socialisation
and values
Nudging
*Mark Musen , https://ncip.nci.nih.gov/blog/face-new-tragedy-commons-remedy-better-metadata/
Based on Matt Spritzer / Brian Nosek figure, COS
50. More than just data
Software, models, workflows, SOPs, Lab Protocols….
4th (and Last story): FAIR Digital Objects
52. FAIR Computational Workflows
The point of FAIR (meta)data was
to be machine actionable….. and
even better if machine generated.
• Operate in FAIR not proprietary
formats
• Support propagation of identifiers,
licenses, and AAI
• Mint FAIR identifiers, track data
provenance, license end products
Goble et al 2019 FAIR ComputationalWorkflows https://doi.org/10.5281/zenodo.3268653
53. FAIR workflows in their own right.
Like Software:
Principles stretched
Versioning
Software maturity, quality, maintainability,
documentation practices
Goble et al 2019 FAIR ComputationalWorkflows https://doi.org/10.5281/zenodo.3268653
54. FAIR workflows in their own right.
Like Data:
We can give them machine
actionable metadata.
Goble et al 2019 FAIR ComputationalWorkflows https://doi.org/10.5281/zenodo.3268653
Describes workflows to be
portable, scalable & interoperable
with different workflow systems and containerised tools
Bundles descriptions, references, files
Adds context, provenance, examples, data …
Relates to data collections, SOPs, lab protocols…
Links CWL descriptions with native workflows
55. Regulatory Practice
robust, safe exchange and reuse of HTS
computational analytical workflows
http://biocomputeobject.org
IEEE P2791
BioComputeWorking Group
[Vahan Simonyan]
Alterovitz, Dean II,Goble,Crusoe, Soiland-Reyes et al “Enabling Precision Medicine via standard communication of NGS provenance, analysis,
and results” PLOS Biology 2018, https://doi.org/10.1371/journal.pbio.3000099
56. A happy ending?
• FAIR is work in progress!
• Keep grounded,
developer friendly and
community supported
• No-one reads specs.
Everyone copies
examples.
• Nipype CWL is coming!
MG-RAST/EBI MGnify
Design by workflow blocks
Pipeline versions comparison
Pipeline exchange
Recycling tool descriptions and
sub-workflows
57. What is FAIR, what should be FAIR and how to
implement it is not simple.
Its not just Good Intentions
A social story, not a technical one.
Without incentives, cultural normalisation and long
term investment it will be a just a story.
INCF’s FAIR Journey….
58. Acknowledgements
Ian Fore
Mark Wilkinson
Susanna Sansone
Stian Soiland-Reyes
Rob Grossman
Barend Mons
Nick Juty
Alasdair Gray
Rafael Jimenez
Michel Dumontier
Michael Crusoe
Ian Cottam
And all the projects and many more
Notas do Editor
https://www.neuroinformatics2019.org
Title: FAIRy stories: tales from building the FAIR Research Commons
Findable Accessable Interoperable Reusable. The “FAIR Principles” for research data, software, computational workflows, scripts, or any kind of Research Object is a mantra; a method; a meme; a myth; a mystery. For the past 15 years I have been working on FAIR in a range of projects and initiatives in the Life Sciences as we try to build the FAIR Research Commons. Some are top-down like the European Research Infrastructures ELIXIR, ISBE and IBISBA, and the NIH Data Commons. Some are bottom-up, supporting FAIR for investigator-led projects (FAIRDOM), biodiversity analytics (BioVel), and FAIR drug discovery (Open PHACTS, FAIRplus). Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. There are villains and heroes. Some have happy endings; all have morals.
FAIR was on the opening slides of the meeting
Maryann Martone is an author along with me
“Cyberinfrastructure that collocates data, storage, and computing infrastructure with commonly used tools for analyzing and sharing data to create an interoperable resource for the research community.”
(Open Commons Consortium)
“An environment where participants make use of computing and communication technologies to access shared instruments and data, as well as to communicate with others” (Wikipedia)
a database organizes data for a project; a data warehouse organizes data for an organization; and a data commons organizes data for a field or discipline. (Bob Grossman)
https://www.humanbrainproject.eu/en/explore-the-brain/
And the HPB Collaboratory
Incrementing
Interop – services, standards, know-how
Stuff is massive legacy
No one governance
13 Ris
Almost like a Meta-Commons
91 properties for dataset
Bioschema’s dataset Compliant with Google Dataset Profile
5 minimal properties
8 recommended properties
Link to DataCatalog
Link to DataDownload
Bioschemas markup added to MarRef pages
Markup crawled using BuzzBang
Data included as a BioSample Curation
Depicted by the External Links
Villains and Heroes
Its is context dependent - fair for a library not for plant sciences.
Though it all helps!
Though links to other metadata help, but they may not be harmonised
Its about identifying and describing stuff.
Subset of the FAIR principles
BIDS, NeuroML and PyNN are endorsed (https://www.incf.org/resources/incf-endorsed-standards-best-practices)
https://www.incf.org/resources/other-standards-best-practices
Beware…
beauty is in the eye of the beholder
What’s FAIR from a Cataloguer perspective maybe useless from a biologists viewpoint
50 shades of FAIR – Robert-John Schmidt
FAIRsFAIR Open Consultation on FAIR Data Policies and Practices in Europe
Bioschemas mark-up about licence?
This group really tried this
Scale up and scale out automation of indicators and their evaluation
Mark volunteers to write compliance tests
Cookbooks, BYODs, Tools
A miracle occurs with very clever people
Running at the same time as defining FAIR
“50 Shades of FAIR”
Identifier use is chaotic, for both data and metadata.
Separating metadata and data is problematic
FAIR is a set of behaviours
not a specific technology
Content negotiation is NOT how you differentiate data from Metadata. It's how you negotiate serialization of the identified thing.
Identifier use is chaotic, for both data and metadata, and no clear way to point from one to another.
Separating what is metadata and what is data given a URI is a problem”
FAIR is a set of behaviours (use of tech and people)
not a specific technology
Born FAIR
Hence stuff like ReproNim need
Community engagement: The ‘last mile’ challenge for European research e-infrastructures
Dimitrios Koureas,
ed
HIDDEN SLIDE
From the opening talk
HIDDEN SLIDE
Villians mentioned: PIs and senior faculty
Heroes: PhD students
Join in!
Like Data: many FAIR Data Principles apply
Repositories (F)
Standardising descriptions of workflow, provenance and components (I, R): CWL, PROV
Metadata about, combining and referencing between components (I, R): Research Objects
HIDDEN SLIDE
The EOSC Life computational workflows stack
Standardize exchange of HTS workflows for regulatory submissions between FDA, pharma, bioinformatics platform providers and researchers
replicate the computational analytical workflow to review and approve the bioinformatics
Inspect and replicate the computational analytical workflow to review and approve the bioinformatics