Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
1. FAIRy stories
for Christmas
Carole Goble
The University of Manchester, UK
carole.goble@manchester.ac.uk
ELIXIR-UK, FAIRDOM, ISBE,
BioExcel CoE, Software Sustainability Institute
Open PHACTS
SWAT4HCLS 2017, 5th Dec 2017, Rome
2. Once upon a time in
a land far, far away
lived a KinG …
Who wanted all data
to be FAIR….
3.
4. Mark D. Wilkinson,
Michel Dumontier,
IJsbrand Jan Aalbersberg,
Gabrielle Appleton,
Myles Axton,
Arie Baak,
Niklas Blomberg,
Jan-Willem Boiten,
Luiz Bonino da Silva Santos,
Philip E. Bourne,
Jildau Bouwman,
Anthony J. Brookes,
Tim Clark,
Mercè Crosas,
Ingrid Dillo,
Olivier Dumon,
Scott Edmunds,
Chris T. Evelo,
Richard Finkers,
Alejandra Gonzalez-Beltran,
Alasdair J.G. Gray,
Paul Groth,
Carole Goble,
Jeffrey S. Grethe,
Jaap Heringa,
Peter A.C ’t Hoen,
Rob Hooft,
Tobias Kuhn,
Ruben Kok,
Joost Kok,
Scott J. Lusher,
Maryann E. Martone,
Albert Mons,
Abel L. Packer,
Bengt Persson,
Philippe Rocca-Serra,
Marco Roos,
Rene van Schaik,
Susanna-Assunta Sansone,
Erik Schultes,
Thierry Sengstag,
Ted Slater,
George Strawn,
Morris A. Swertz,
Mark Thompson,
Johan van der Lei,
Erik van Mulligen,
Jan Velterop,
Andra Waagmeester,
Peter Wittenburg,
Katherine Wolstencroft,
Jun Zhao,
Barend Mons
Wilkinson Dumontier Schultes
Scientific Data 3, 160018 (2016)
doi:10.1038/sdata.2016.18
9. Stakeholder FAIR Awareness
UK Institutional Research Data Management guidance*
* Jisc: Final Report FAIR in Practice, Nov 2017
Government,
Funder,
Publisher,
National &
International
Infrastructures…
Institutional
Researchers
FAIR spread across the lands …… BUT not
necessarily all the peoples
15. Beware…
beauty is in the
eye of the
beholder
What’s FAIR from a Cataloguer
perspective maybe useless from
a biologists viewpoint
16. My Semantic FAIRy Stories
The Scientist and
the FAIR Commons
The MAGIC
Research Object
little semantics and
the big Web
17. The Scientists and the
FAIR Research
Commons
Supporting mixed
types and many
researchers
FAIR
18. The Scientists and the
FAIR Research
Commons
Find:
ID resolution
Faceted Navigation
Search, RDF
SPARQL endpoint, APIs
A Commons for Workflows
myexperiment.org
A Commons for Systems Biology Projects
fairdomhub.org
investigation
study
assay/analysis
data
models
SOPs
19. Community & Project Commons
Structured
organisation
across standards
and types
Federation over
autonomous
resources
Laissez-Faire
Independent
Users
Ecosystem of
types, stores
and metadata
20. Own little houses: from straw to bricks
Permission controls
Staged sharing
Licenses
Negotiated access
Embargos
Open
22. Getting the best FAIR metadata….
FAIR Access
– myExperiment -> open
– FAIRDOM -> friends and family
– Hand over straw houses to FAIRDOMHub
“TheTragedy of the Commons”*
– Metadata quality and quantity
– Identifier hygiene
– Curation & contributions
– Public good vs personal burden
– Incorporation into processes
– Community socialisation - obligations mismatches. Credit!
*Mark Musen , https://ncip.nci.nih.gov/blog/face-new-tragedy-commons-remedy-better-metadata/
27. “The Last Mile”* -> The First Mile
FAIR from bench to cloud
Last mile - Infrastructure
view
First mile - researcher /
resource view
* Dimitrios Koureas et al Community engagement: The ‘last mile’ challenge for
European research e-infrastructures
Research I deas and Outcomes 2: e9933 (20 Jul 2016)
https://doi.org/10.3897/rio.2.e9933
29. The MAGIC Research
OBJECT
GENERIC Framework
For exchange,
reproducibility,
Preservation, active
artefacts
Universal Catering,
bottomless content
FAIR
30. The FAIR Research Object
import, exchange, portability, maintenance
ISA-TAB
Bergman et al COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project,
BMC Bioinformatics 2014, 15:369
31. workflow engine
Workflow Run
Provenance
Inputs Outputs
Intermediates
Parameters
Configs
Narrative
Exchange between people & platforms
Commons store, catalogue & archive
Reproduce preserve, port, repair
Activate re-compute, mix, compare,
evolve
The FAIR Workflow Research Object
32. researchobject.org
Bechhofer et al (2013) Why linked data is not enough for scientists https://doi.org/10.1016/j.future.2011.08.004
Bechhofer et al (2010) Research Objects: Towards Exchange and Reuse of Digital Knowledge, https://eprints.soton.ac.uk/268555/
Standards-based generic
metadata framework for
bundling internal and external
resources with context
citable reproducible packaging
Data used and results produced in study
Methods employed to produce/analyse data
Provenance and settings for the experiments
People involved in the investigation
Annotations about these resources:-
understanding & interpretation
33. Linking across ROs and into the
Linked Open Data Cloud
• Recording & linking together the
components of an experiment
• Linking across experiments.
• Linked ROs
• A SemanticWeb of Research
Objects
• Resource References – a
bottomless pot
34. Technology Independent.
The least possible.
The simplest feasible. Low tech.
Low user overhead and thin client
Graceful degradation.
FAIR ROs Desiderata
35. Construction Content Profile
Types
Identification
to locate things
Aggregates
to link things together
Annotations
about things & their
relationships
Type Checklists
what should be there
Provenance
where it came from
Versioning
its evolution
Dependencies
what else is needed
Manifest checklist
Type Checklists
describing what
should be there
Container
Metadata
Objects
38. Profile
http://purl.org/minim/description
W3C
Shape Specs
*Gamble, Zhao, Klyne, Goble. "MIM: A Minimum Information Model Vocabulary and Framework for Scientific Linked
Data", IEEE eScience 2012 Chicago, USA October, 2012), http://dx.doi.org/10.1109/eScience.2012.6404489
validators / viewers
Minim model for
defining
checklists*
multiple profiles for
different consumers
Generic
Specifics
RO-SHOW
Container
39. Linked Data
Pharmacological
Discovery Platform
Data Releases
Dataset “build”
RO Library
Earth Sciences
Public Health Learning Systems
Asthma Research e-
Lab sharing and
computing statistical
cohort studies
Happy Endings!
ISA based Packaging,
Systems Biology commons
& publishing
Managing distributed
unmovable large datasets
for Biomedical HTS
analytic pipelines *
* Chard et al I'll take that to go: Big data bags and minimal identifiers for exchange of large, complex datasets,
https://doi.org/10.1109/BigData.2016.7840618
40. Happy Ending – Workflows
Biomedical HTS analytic pipelines
Manifest description of
CWL workflows + rich
context + provenance +
other objects + snapshots
Precision medicine
NGS pipelines regulation*
*Alterovitz, Dean II, Goble, Crusoe, Soiland-Reyes et al Enabling Precision Medicine via standard communication of NGS provenance, analysis, and results, biorxiv.org,
2017, https://doi.org/10.1101/191783
EDAM
Biomolecular modelling
PortableWorkflows
42. Morals
Incremental, open frameworks hard work
– Extensive reuse of standards is tricky
– Too Generic vsToo Specific
– Multi-element type & nesting challenges
– ROs with a Purpose
– Examples & templates
Representational Beauty vsTools
– Easy to make, hard to consume
– Be specific, be developer friendly
– Profiles & tools critical
Patience is a virtue
44. Structured data markup for web pages
Schema.org adds simple
structured metadata markup to
web pages & sitemaps for
harvesting, search and summary
snippet making.
Search engines often highlight
websites containing Schema.org
Widespread commercial and
open source infrastructure
creates a low barrier to adoption
45. Goldilocks & the 3 Use Cases
Standardised
metadata
mark-up
Metadata
published &
harvested
withoutAPIs
or special
feeds
3 Use Cases
1. Finding/Citing,
2. Summary snippets
3. Metadata exchange /
ingest
Goldilocks
• Reuse ubiquitous
commercial platform
• The least possible change,
the max possible reuse
• Minimum properties – 6
• Reuse domain ontologies –
we are not reinventing
them!
Commodity
Off the Shelf tools
App eco-system
Repository Level
Content type level
48. schema.org tailored to the Biosciences
simple structured metadata markup on web pages & sitemaps
• Specific for life sciences
• Extends existing Schema.org types
• Focused on few types and well defined relationships
• Minimum properties for finding and accessing data
• Best practices for selected properties
• Managed by Bioschemas.org
• Generic data model
• Generous list of properties to describe data types
• Managed by Schema.org
49. Tailored schema.org to improve
Findability and Accessibility in Bioscience
Layer of constraints +
documentation + extensions
Leyla Garcia. Poster & Flashtalk
50. 2-3 Oct 2017, Hinxton, ~50 people
Ideally 6 concepts
Reuse ontologies
schema.org
Real mark-up
Tools
Find, Cite, Snippets,
Metadata exchange
Community
52. MORALs
Community Buy-in Worth it
• First specs & main mechanism for training
• Google / Schema & ELIXIR support
• Research Schemas for EuropeanOpen
Science Cloud pilot
Goldilocks works but is hard work
• Types & Profiles debates
• Elegance vs best for tools
• Reuse domain ontologies
• Validation, mark-up & harvesting tools
Trolls
53. How are we FAIRing?
Different levels with different emphasis
Its an Ecosystem, not a single solution
• Catalogues, Search, Stores
• Metadata Standards
• StandardAccess protocols
• Identifiers, Policies
• AuthorisedAccess
• Licensing
54. smart rebrand launch
Still hard, same stuff
Rally big communities
and grassroots initiatives
Examine our capabilities
There is no magic
56. Platform & user buy-in from the get-go
Passionate, dedicated leadership
Seeding critical mass
Community
Tools Driver
Bottom up initiatives fostered by big
umbrellas infrastructures
FAIR Semantic Village*
Simple & Lightweight
Ramps not revolutions
FAIR with a PURPOSE & With PEOPLE
FAIR
Support typical developer –
Familiarity – JSON, APIs
*Deb McGuinness
57. Research for FAIR
FAIR representation
• The Semantic Web
Automated metadata
• Deep learning, machine learning, AI
• Text Mining, Ontology mapping
Social metadata
• User Experience, Crowd Sourcing
• Choice architecture
FAIR action
• Blockchain
• Virtualised & remote execution
• Image processing
• Preservation & portability
• Provenance tracking, object trajectories
• Engineering & Design, Ethics, Social Sciences
Research +
Developer Practitioner
practices
58. Mark Robinson
Norman Morrison
Paul Groth
Tim Clark
Alejandra Gonzalez-Beltran
Philippe Rocca-Serra
Ian Cottam
Susanna Sansone
Kristian Garza
Daniel Garijo
Catarina Martins
Iain Buchan
Caroline Jay
David De Roure
Oscar Corcho
Steve Pettifer
Khalid Belhajjame
Jun Zhao
Phil Crouch
Lilian Gorea,
Oluwatomide Fasugba
Stian Soiland-Reyes
Michael Crusoe
Rafael Jimenez
Alasdair Gray
Barend Mons
Sean Bechhofer
Michel Dumontier
Mark Wilkinson
Leyla Garcia
Stuart Owen
KatyWolstencroft
Finn Bacall
Alan Williams
Wolfgang Mueller
Olga Krebs
Jacky Snoep
Matthew Gamble
Raul Palma
Mark Musen
http://www.researchobject.org
http://www.myexperiment.org
http://wf4ever.org
http://www.fair-dom.org
http://www.fairdomhub.org
http://seek4science.org
http://rightfield.org.uk
http://www.bioschemas.org
http://www.commonwl.org
http://www.bioexcel.eu
http://www.openphacts.org
Notas do Editor
Findable Accessable Interoperable Reusable < data |models | SOPs | samples | articles| * >. FAIR is a mantra; a meme; a myth; a mystery; a moan. For the past 15 years I have been working on FAIR in a bunch of projects and initiatives in Life Science projects. Some are top-down like Life Science European Research Infrastructures ELIXIR and ISBE, and some are bottom-up, supporting research projects in Systems and Synthetic Biology (FAIRDOM), Biodiversity (BioVel), and Pharmacology (open PHACTS), for example. Some have become movements, like Bioschemas, the Common Workflow Language and Research Objects. Others focus on cross-cutting approaches in reproducibility, computational workflows, metadata representation and scholarly sharing & publication. In this talk I will relate a series of FAIRy tales. Some of them are Grimm. Some have happy endings. Who are the villains and who are the heroes? What are the morals we can draw from these stories?
The additions are hidden behind these … just as important and not the same….
Many Princes Scientific Data 3, Article number: 160018 (2016)DOIdoi:10.1038/sdata.2016.18
https://www.nature.com/articles/sdata201618 (2016)
ELIXIR, RDA
Child as first payment
Be careful what you promise
Slide from NLM CLA
RIN?
CERIF, CLARIN
me too!the elephant & blind men
Who are the witches and the godmothers?
What the get out clause?
Three – open PHACTS?
What did we learn – much harder than you think.
Windsor….what did we learn?
Distributed commons
Dig out user numbers
Cliques and complementarity
Visibility is muted.
Licensing…
PI leadership
Sticking to conventions
Local responsibility
Time and resource
Curation recognition
Trust
Tribal trading behaviours
Enclave sharing
Not public donation
Reciprocity & credit
Drivers …
External dominate
Personal productivity
Stratified to hide the visible from the invisible.
We also have APIs, RAILS
Consumer – producer obligations mismatches
Wolves: Project PIs, funders, time
Godmothers: Project PIs, “PALs”, templates, funders
Deferred pain
The ant and the grasshopper
Automate or sneak
From the IB 13 talk and the Group 09 talk
Active enclave sharing
Public sharing tricky even after publication, bribery and threats
Data Hugging, Flirting and Voyerism
Playground rules apply
Fluid, transient collaborations > membership mgt pain in a*se
Shameless exploitation of PI competitiveness & vanity
PI & Funder leadership
Pan project spawned collaborations – YES!!!!
But not necessarily visible to us.
PALs are also the cinderellas
The scientists’ world does not revolve around your infrastructure or agenda.
Bullying doesn’t work
Fame / Shame
Money / Burden
Love / Fear
Side effect / special effort
Templates! Spreadsheets
spreadsheets are your friend, not Cinderellas
Similarly on myexperiment – metadata in CWL can be extracted…
Choice
Don’t necessarily interleave
Across platforms
Bechhofer, Sean, De Roure, David, Gamble, Matthew, Goble, Carole and Buchan, Iain (2010) Research Objects: Towards Exchange and Reuse of Digital Knowledge At The Future of the Web for Collaborative Science (FWCS 2010), United States.
Why linked data is not enough for scientists
Sean Bechhofer, Iain Buchan, David De Roure, Paolo Missier, John Ainsworth, Jiten Bhagat, Philip Couch, Don Cruickshank, Mark Delderfield, Ian Dunlop, Matthew Gamble, Danius Michaelides, Stuart Owen, David Newman, Shoaib Sufi, Carole Goble
Publication date
2013/2/28
Journal
Future Generation Computer Systems
Volume
29
Issue
2
Pages
599-611
Publisher
North-Holland
Recording & linking together the components of an experiment
Linking across experiments.
Linked Ros
Bigger on the inside than the outside
Predated the FAIR Principles
Element enumeration
Identification & citation
Description tracking attributes (metadata) and origins (provenance) of contents.
Simplicity - low user overhead and thin (no) client
RO-bagit
Generic tools
multiple bespoke profiles – RDA Data Provenance approach. One for CERIF, one for DataCite
Typing
HIDDEN SLIDE
Specific to the generic
HIDDEN SLIDE
Context of data content together when its scatteredtransferring and archiving very large HTS datasets in a location-independent way
These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.
SEAD – Jim Myers
Too vague and too general – needed profile lock-down
Can’t make profiles in the abstract
First specifications:
Bio data infrastructure
Data Catalog
Datasets
Bio data types
Human beacons
Samples
Plant Phenotypes
Proteins
(Chemistry)
Bio stuff
Training materials
Events
Laboratory protocols
Workflows and Tools
Of course this is relevant to ROs – dataset in particular is similar to collection. An RO is a structured collection.
Now the most popular mechanism for publishing and harvesting metadata, beating APIs and scrapping.
HIDDEN SLIDE
Usecases
Biobanks should be able to crawl the BioSamples database to identify all the published (and searchable) datasets derived from samples they have provided
Public archives should be able to crawl Biobank websites, in order to identify samples that are known to have public accessions in the BioSamples database AND that can be made publicly available, and thereby link public samples to a provider (“where can I get more of this sample?”).
In case of privacy or consent considerations, only the biobank should know what are the specific samples connected to publicly available datasets
Public archives should be able to crawl Biobank websites, in order to identify ‘sanitised’ sample metadata descriptions (again, in case of confidentiality or consent considerations). Biobanks remain responsible for ensuring only authorised metadata is visible, and can control access to restricted samples.
Assumptions
Each sample provided by a biobank has an opaque pseudo-anonymous identifier that is assigned by the biobank to identify a specific sample (referred to hereafter as the “sample name”)
Each sample reported in a public archive or used to generate a public dataset has a public, BioSamples database accession (hereafter called “sample identifier”).
In some cases, a biobank may issue different sample identifiers when providing the same sample to different projects. This may result in duplicated sample accessions in the BioSamples database
Given these use cases and assumptions, we will use Bioschemas to describe sample links. The main challenge is therefore the identification of links between sample identifiers (within Biobanks) and sample accessions (from the BioSamples database). This is not always possible without considerable additional curation effort, but of the 5 million samples in the BioSamples database, over 4 million declare either a ‘synonym’, ‘sample source name’ or ‘source name’ attribute, frequently used to encode the original biobank sample name. Exposing these in a structured manner through the BioSamples database would allow Biobanks to crawl and analyse this content, marrying sample that are recognised with their own internal identifiers.
Once this mapping is done, Biobanks can then re-expose these links through structured content on their own websites, allowing public resources to reciprocate links from public records back to the sample provider.
Implementation Study Outline
Objectives
Facilitate the ingestion of sample metadata from data repositories (eg. Biobank databases) into registries like the BioSamples, BBMRI Biobank directory or the UKCRC Tissue Directory via Bioschemas.
Engage and help data providers and developers of BioBank LIMS to test and adopt the exposure of sample metadata via Bioschemas
Contribute to contextualise information from data sample registries (eg. BioSamples) and biobank sample repositories (eg. NL Biobank) and Biobank Registries (eg. BBMRI Biobank directory)
Make registries like BioSamples compliant with Bioschemas.
Biobanks crawl BioSamples to discover sample accessions, markup etc if they have 'known' biobank name fields.
Sample (study) catalogues provide findability for the individual samples
- Aligning with MIABIS Sample Donor and Sample modules
Work with repositories/Biobanks/LIMS to adopt Bioschema
• Develop general crawler: in collaboration with Bioschema community
F2Share (Federation framework for data Sharing): https://github.com/MIABIS/logstash-configuration-generator/wiki
More tools needed than thought!
14+ repositories marked up
HIDDEN SLIDE
Maintain common profiles across scientific domains focused on finding and accessing data
Minimum properties
General best practices
Support different scientific domains to extend and develop domain specific profiles
Evidence for the funders and researchers
Focused on technical and social, but the economics and political is critical.
Ecosystem
Grassroots community activities
Fostered by Infrastructure Initiatives
Don’t squash the start up!
Open standards and lightweight
Practical engineering
Keeping it simple and real
Ramps rather than Revolution
Specialist, bespoke
Rise of containers
Too vague and too general – needed profile lock-down
Can’t make profiles in the abstract
Added afterwards….
Successes
Multiple apps developed
500+ users
20-30 million hits a month
Used to answer real pharmaceutical research questions
API documentation
Lessons
Support the typical app developer workflow (i.e. APIs, JSON)
Support domain specific (non-RDF) services
Identifier equivalence is non-trivial
Free text search is important
Staying up-to-date with dataset updates is a challenge