CDISC2RDF poster for Conference on Data Integration in the Life Sciences 2013
1. CDISC2RDF
Making clinical data standards linkable, computable and queryable
The CDISC2RDF initiative exploits Semantic
Web standards and Linked Data principles for
clinical data standards from CDISC (Clinical
Data Interchange Standards Consortium).
Introduction
Clinical data standards have been identified as one of five
initial areas by the TransCelerate BioPharma, the non-profit
organization formed by ten leading pharmaceutical companies,
to accelerate the development of new medicines.
The European Medicines Agency (EMA) is developing a policy
on the proactive publication of clinical-trial data in the interests
of public health including clear and understandable clinical
data formats. The FDA has a long-held goal of making better
use of submitted clinical trial data. Pharmaceutical companies
have attempted to use submission standards to create study
repositories.
Exploiting Semantic Web technologies stands to simplify the
interpretation of individual studies, and improve cross-study
integration.
Kerstin Forsberg, Informatics Scientist
kerstin.l.forsberg@astrazeneca.com
Analysis, Informatics & Knowledge Engineering Practice, AstraZeneca, Sweden
CDISC2RDF Schemas
The first version of the core CDISC2RDF schemas were
intentionally developed to represent a minimal part of the
ISO11179 model for metadata registries.
The Meta Model Schema (mms) represents the core Data
Description part of the ISO11179 model, Part 3: Registry
metamodel and basic attributes
From human readable documentation and “Text strings”
In the domain of clinical research CDISC, a non-profit
organization, have developed standards for study design
(SDM), study data collection (CDASH), study data analysis
(ADAM), and submission to the regulatory bodies (SDTM).
These represent a limited set of data elements with names
such as “RACE“, that also have a value set derived from NCI
Thesaurus. However, most of the data elements are
containers for contextual variables with names such as
“VSDATE” and “AEACN” (Date of measurement of Vital Signs and
Action Taken for Adverse events), and of the data elements for
the results of the measurements. These are indirectly indicated
in variables called “TESTCD” with a term, or rather a text string
such as “DIABP”, “BMI”, “HGB” representing the measurement
procedures, “ listed in the so called controlled terminologies
(CT) for SDTM (Study Data Tabulation Model).
Today all data standards and controlled terminologies, are
published as PDF:s, Excel , and traditional XML, by CDISC
and NCI EVS.
Human readable documentation in
PDF:s, Excel:s (and some in XML)
CDISC2RDF Schemas
(based on the core of ISO11179)
Machine processable linked
data structured as RDF triples
Meta model schema
(mms)
(Data definition, the core part of ISO 11179)
Controlled Terminology schema
(cts)
(a few additional properties
from the NCI Thesaurus export)
SDTM 1.2 schema
(sdtms)
(classifiers: Data Element roles and types)
SDTM 3.1.2 IG schema (sdtmigs)
(a few additional properties)
To machine processable RDF triples and “URI:s”
The first deliverable from the CDISC2RDF project was
published early 2013. It contained OWL/RDF files (triples) for
CDISC submission standards: SDTM 1.2, Implementation
Guideline (IG) 3.1.2 and Controlled Terminology (CT), plus
CTs for data capture standards (CDASH) and analysis
standards (ADaM).
Each data element / column, dataset, code list, classifier etc.
have got URI:s (Uniform Resource Identifiers) assigned to
them:
Meta model schema
(mms)
(Data definition, the core part of ISO 11179)
The SDTM schema (sdtms) version 1.2 defines additional
classifiers in the underlying model such as the data
element role: Record Qualifier and also Expected variable.
The Controlled Terminology schema (cts) adds to the
metadata model schema (mms) a few additional
classifications and properties to represent the existing NCI
Thesaurus EVS export.
The classes and properties are being used to annotate the
Excel column headers and the standard import
functionality in the TopBraid Composer tool have been
used to create the RDF triples in XML, Turtle, and JSON
formats.
CDISC2RDF started as a cross-pharma pre-
competitive project with AstraZeneca, Roche,
TopQuadrant, Free University of Amsterdam
and W3C HCLS to show case the use of
Semantic Web standards and Linked Data
principles.
It is now incorporated in the Semantic
Technology project, part of the FDA/PhUSE
working group on Emerging Technologies with
representatives across FDA, CDISC, pharmas,
CRO:s and software vendors.
We want to push back to CDISC and NCI, and other public and internal standard
groups, and show in practice how to “Use (semantic web) standards for standards”
http://rdf.cdisc.org/sdtmig-3-1-2/std#Column.AE.AEACN
http://rdf.cdisc.org/sdtmig-3-1-2/std#Table.AE
http://rdf.cdisc.org/sdtm-1-2/schema#Classifier.RecordQualifier
All OWL/RDF files, schemas and standards
are available on https://code.google.com/p/cdisc2rdf/