1) The CEDAR team at Stanford University is developing tools and standards to improve metadata for published datasets in order to make data more FAIR (Findable, Accessible, Interoperable, Reusable).
2) Current metadata is often incomplete, inconsistent, and not machine-readable, limiting how data can be searched and reused. CEDAR aims to address this through standardized templates, controlled vocabularies, and tools that make high-quality metadata easier for researchers to produce.
3) If experimental metadata can be standardized and disseminated in a machine-interpretable way using approaches like CEDAR, it will support new technologies for automated literature searches, data integration, and scientific advances, enabling more
dkNET Webinar - FAIR Data Require Better Metadata: The Case for CEDAR 11/13/2020
1. 11/13/20
1
FAIR Data Require Better Metadata:
The Case for CEDAR
Mark A. Musen, M.D., Ph.D.
and the CEDAR Team
Stanford University
musen@Stanford.EDU
1
A Story
2
2. 11/13/20
2
Purvesh Khatri, Ph.D. A self-professed “data parasite”
3
Khatri has reused public data sets to
identify genomic signatures …
• For incipient sepsis
• For active tuberculosis
• For distinguishing viral from bacterial
respiratory infection
• For rejection of organ transplants
Now he’s working on diagnosis and prognosis of COVID-19
… and he has never touched a pipette!
4
3. 11/13/20
3
But you can be a data parasite only if
scientific data are …
• Available in an accessible repository
• Findable through some sort of search
facility
• Self-describing so that third parties can
make sense of the data
• Intended to be reused, and to outlive the
experiment for which they were collected
5
6
4. 11/13/20
4
The FAIR Principles
F1: (Meta) data are assigned globally
unique and persistent identifiers
F2: Data are described with rich
metadata
F3: Metadata clearly and explicitly
include the identifier of the data they
describe
F4: (Meta)data are registered or indexed
in a searchable resource
A1: (Meta)data are retrievable by their
identifier using a standardised
communication protocol
A1.1: The protocol is open, free and
universally implementable
A1.2: The protocol allows for an
authentication and authorisation where
necessary
A2: Metadata should be accessible even
when the data is no longer available
I1: (Meta)data use a formal, accessible,
shared, and broadly applicable language
for knowledge representation
I2: (Meta)data use vocabularies that follow
the FAIR principles
I3: (Meta)data include qualified references
to other (meta)data
R1: (Meta)data are richly described with a
plurality of accurate and relevant
attributes
R1.1: (Meta)data are released with a clear
and accessible data usage license
R1.2: (Meta)data are associated with
detailed provenance
R1.3: (Meta)data meet domain-relevant
community standards
7
Our current approaches to
FAIRness are failing
• Reviewing the FAIR principles as written
• Questionnaires and checklists
– Require detailed knowledge of the resource
– Can’t really call out dishonesty or wishful thinking
– Are tedious and time-consuming
• Automated FAIRness checkers
– Provide a report after it has become too late to do
anything about the problems
– Don’t identify which factors the user can actually
control
8
5. 11/13/20
5
What investigators can control:
• It’s not the arcane stuff, such as
– Use of globally unique and persistent identifiers
– Metadata registration in a searchable index
– Data are retrievable by their identifier using a
standardized communication protocol
• It’s the concrete stuff they understand
– The metadata that describe the experiment
– The metadata that describe the experiment
– The metadata that describe the experiment
9
Metadata in most repositories are a mess!
• Investigators view their work as
publishing papers, not leaving a legacy of
reusable data
• Sponsors or management may require
data sharing, but they do not encourage
the use of research funds to pay for it
• Creating the metadata to describe data
sets is unbearably hard
10
7. 11/13/20
7
13
age
Age
AGE
`Age
age (after birth)
age (in years)
age (y)
age (year)
age (years)
Age (years)
Age (Years)
age (yr)
age (yr-old)
age (yrs)
Age (yrs)
age [y]
age [year]
age [years]
age in years
age of patient
Age of patient
age of subjects
age(years)
Age(years)
Age(yrs.)
Age, year
age, years
age, yrs
age.year
age_years
Failure to use standard terms makes
datasets often impossible to search
14
8. 11/13/20
8
NCBI BioSample Metadata are Dreadful!
• 73% of “Boolean” metadata values are not
actually true or false
– nonsmoker, former-smoker
• 26% of “integer” metadata values cannot be
parsed into integers
– JM52, UVPgt59.4, pig
• 68% of metadata entries that are supposed to
represent terms from biomedical ontologies do
not actually do so.
– presumed normal, wild_type
15
16
9. 11/13/20
9
We will never have FAIR data until …
• We have consensus regarding what is
good metadata
• We have standards for implementing
good metadata
– For the overall structure of the metadata
– For the values that we use
• We have tools that make it easy to create
good metadata
17
Requirement #1: Standard
ontologies to describe what
exists in a dataset completely
and consistently
18
14. 11/13/20
14
The microarray community took the lead in
standardizing biological metadata
DNA Microarray
— What was the
substrate of the
experiment?
— What array
platform was
used?
— What were the
experimental
conditions?
27
28
15. 11/13/20
15
But it didn’t stop with MIAME!
• Minimal Information About T Cell Assays
(MIATA)
• Minimal Information Required in the
Annotation of biochemical Models (MIRIAM)
• MINImal MEtagemome Sequence analysis
Standard (MINIMESS)
• Minimal Information Specification For In Situ
Hybridization and Immunohistochemistry
Experiments (MISFISHIE)
29
Minimal Information Guidelines
are not Models
• MIAME and its kin specify only the
“kinds of things” that investigators should
include in their metadata
• They do not provide a detailed list of standard
metadata elements
• They do not provide datatypes for valid
metadata entries
• It takes work to convert a prose checklist into
a computable model
30
16. 11/13/20
16
To make sense of metadata …
• We need computer-based standards
– For what we want to say about experimental data
– For the language that we use to say those things
– For how we structure the metadata
• We need tools that make it easy to apply
these standards
• We need a scientific culture that values
making experimental data accessible to others
31
Requirement #3: Make it
palatable to describe
experiments completely and
consistently—which means we
need good tools!
32
23. 11/13/20
23
brain
Parkinson’s disease (DOID) (39%)
central nervous system lymphoma (DOID) (27%)
autistic disorder (DOID) (22%)
melanoma (DOID) (5%)
schizophrenia (DOID) (1%)
Edwards syndrome (DOID) (2%)
45
Some key features of CEDAR
• All semantic components—template elements,
templates, ontologies, and value sets—are managed as
first-class entities
• User interfaces and drop-down menus are not
hardcoded, but are generated on the fly from CEDAR’s
semantic content
• All software components have well defined APIs,
facilitating reuse of software by a variety of clients
• CEDAR generates all metadata in JSON-LD, a widely
adopted Web standard that can be translated into
other representations
46
24. 11/13/20
24
47
age
Age
AGE
`Age
age (after birth)
age (in years)
age (y)
age (year)
age (years)
Age (years)
Age (Years)
age (yr)
age (yr-old)
age (yrs)
Age (yrs)
age [y]
age [year]
age [years]
age in years
age of patient
Age of patient
age of subjects
age(years)
Age(years)
Age(yrs.)
Age, year
age, years
age, yrs
age.year
age_years
48
25. 11/13/20
25
Who’s Using CEDAR?
49
Online Data Will Never Be FAIR
• Until we standardize metadata structure using
common templates
• Until we can fill in those templates with
controlled terms whenever possible
• Until we create technology that will make it
easy for investigators to annotate their
datasets in standardized, searchable ways
• Until the scientific enterprise adopts
this kind of technology
50
26. 11/13/20
26
The “research
parasites” of the
world who reuse
data for a living will
be able to learn
from datasets that
are “cleaner,” better
organized, and
more searchable
Purvesh Khatri, Ph.D.
51
Scientists will learn
when new results
become available
that are relevant to
their work, and how
those results relate
to the data that
they are already
collecting
52
27. 11/13/20
27
Clinicians will be
able to find the
evidence that best
helps them to select
treatments for their
patients
53
Once comprehensive, experimental
metadata are disseminated in
machine-interpretable form …
• Technology such as CEDAR will assist in the
automated “publication” of scientific results
online
• Intelligent agents will
– Search and “read” the “literature”
– Integrate information
– Track scientific advances
– Re-explore existing scientific datasets
– Suggest the next set of experiments to perform
– And maybe even do them!
54