This document discusses semantic data normalization of clinical trial data to make it more structured and amenable to analysis. It describes converting unstructured clinical data like conditions, interventions, adverse events and eligibility criteria into RDF triples. The goal is to extract key phrases and concepts, identify qualifiers and relationships to formally represent the data. Examples show how condition texts, drug annotations and criteria can be modeled. Current work has normalized over 215,000 clinical studies from ClinicalTrials.gov into over 80 million RDF triples. The normalized data is pre-loaded in GraphDB and Ontotext S4 Cloud and can be explored and analyzed more easily.
2. • The specifics of clinical data
• What is RDF and how we can use it together with TA?
• Semantic annotations and their limitations
• What is semantic data normalization?
• Current state and next steps
Outline
September 8th, 2016
3. • Unstructured (Semi-Structured)
• Abundant
• Redundant
• Ambiguous
• Aggregated
Clinical Data
September 8th, 2016
In order to transform your clinical data into information and even knowledge, you will have to
analyze it!
… but before that you have to make it ready for the analysis!
4. September 8th, 2016
What is RDF
RDF data model resolves all syntax level ambiguities
It helps you express all data in a common data model
ID GRAA_HUMAN STANDARD; PRT; 262 AA.
AC P12544; DT 01-OCT-1989 (Rel. 12, Created)
DT 01-OCT-1989 (Rel. 12, Last sequence update)
DT 15-JUN-2002 (Rel. 41, Last annotation update)
DE Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T-
lymphocyte proteinase
DE 1) (Hanukkah factor) (H factor) (HF) (Granzyme 1)
(CTL tryptase)
DE (Fragmentin 1). GN
GZMA OR CTLA3 OR HFSP. OS Homo sapiens
(Human).
<PubmedArticle> <MedlineCitation Owner="NLM"
Status="In-Process"> <PMID
Version="1">21500419</PMID> <DateCreated>
<Year>2011</Year> <Month>04</Month>
<Day>15</Day> </DateCreated> <Article
PubModel="Print"> <Journal> <ISSN
IssnType="Electronic">1520-6882</ISSN>
<JournalIssue CitedMedium="Internet">
<Volume>82</Volume> <Issue>20</Issue>
<PubDate> <Year>2010</Year>
<Month>Oct</Month> <Day>15</Day>
</PubDate> </JournalIssue>
5. Linked Data
How well interlinked is the linked data cloud?
•Many interesting queries are difficult to be expressed in SPARQL
•String functions could not be index
•Often there are misplaced identifiers
P29965
UNIPROT
CD40L_HUMAN
cpath:CPATH-94138
cpath:CPATH-LOCAL-8467065
cpath:CPATH-LOCAL-8749236
uniprot:P29965
CD40L_HUMAN
TNF5_HUMAN
CD4L_HUMAN
#5
September 8th, 2016
7. • Good for:
– Generation of machine readable meta data
– Semantic indexing of large sets of documents
– Providing additional background knowledge
• Limitations:
– Incomplete knowledge extraction
– Does not capture completely the context
Semantic Annotations
September 8th, 2016
8. • What is it?
– A text analytics approach that aims to capture the full
context of the information and to provide clear references to
concepts/objects in order to be easily interpreted by
machines.
• How we do it?
– Work on sentence level
– Extract the key phrases from the sentence
– Identify the main concept
– Identify all the qualifiers and negations
– Model the extracted data as RDF
Semantic Data Normalization
September 8th, 2016
9. Semantic Data Normalization
September 8th, 2016
• Condition text:
– “Advanced Biliary Tract Adenocarcinoma” (Study ID = NCT01506973)
• Text Analysis
– One phrase is identified in the Condition text
– Advanced Biliary Tract Adenocarcinoma
• Data Schema
– One annotation object is created
– Main concept is “Adenocarcinoma”
– Qualifier concepts are “Advanced” and “Biliary tract”
11. • Study Conditions
– Multiple phrases in a text
– Pre-coordinated concepts vs. post-coordinated
– Scoring of matching concepts
• Study Interventions
– Drug, route, form
– Drug dosage
• Adverse Events
– Normalization of AE
– Post-coordinated concepts
• Eligibility Criteria
– Semantic sectioning and categorization
– Negations
– Diseases, findings, treatments, age and gender
Demo Example
September 8th, 2016
12. Intervention Annotation Model - Drugs
September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:hasIntervention
in:drugAnnotation
DrugAnnotationID
da:hasDrug
111418
da:hasAdministrationRoute
do:hasSingleDose
DrugDosageID
SingleDoseID PeriodID
do:hasPeriod
NCT01506973_1_2
SCTID:111418
SCTID:121681
da:hasDosage
do:hasFrequency
FrequencyID
Value Unit
Denominator
Value
Denominator
Unit
da:hasAdministrationForm
13. Criteria Annotation Model
September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:hasCriteriaSection
cs:hasCriterion
Criterion
cr:hasText
cr:hasAnnotation
CriteriaSection
AnnotationId
sa:Negation
rdf:type “Inclusion”/”Exclusion”/”Not defined”
cs:hasText
…
No extensive intraductal components on core
biopsy, defined as intraductal carcinoma.
Patients must not have recurrent invasive breast
cancer.
…
Patients must not have recurrent invasive breast
cancer.
“Disease”/”Drug”/…rdf:type
“True”/”False”/…Property 1Property 2Property N
14. • Work with ClinicalTrials.gov data as public show case
– > 215K clinical studies
– > 76 million RDF statements
• Coverage
– Conditions (197,154 objects)
– Diseases, Findings, Body locations, Qualifiers
– Interventions (rdf:type = ‘Drug’ and rdf:type = ‘Biologics’) – (381,590 objects)
– Drugs, Dosages, Administration form, Administration route, Population group
– Adverse Events – (1,226,754 objects)
– Diseases, Findings, Body locations, Qualifiers
– Criteria (semantic sectioning and categorization, negations) – (7,216,361 objects)
– Diseases, Findings, Drugs, Population groups
• In total more than 80 millions of RDF triples
Current Status
September 8th, 2016
15. • Directly mine the public enhanced CT.gov version
• Apply the same approach over your internal clinical trials data
• Once the data is semantically normalized you can “slice and
dice” it as your use case requires
• Examples
– Top-bottom data exploration
– Linked data browsing
How Can I Use This?
September 8th, 2016
16. Next Steps
• Release RDFized version of ClinicalTrials.gov
• Pre-loaded in GraphDB Free
• Pre-loaded on Ontotext S4 Cloud
• As RDF serialization distribution
• Release all semantically structured information
under free for non-commercial use license
• Extend the data schema to support not only
concepts but also tokens which cannot be
normalized to ontology instances