Discussion of provenance usage in the Learning Health System paradigm, as implemented in the TRANSFoRm project, with focus on security requirements and how they can be addressed using provenance graph abstraction.
2. Overview
• Learning Health System
• LHS requirements for provenance data
• TRANSFoRm project
• Transformation-oriented Access Control Language
for Provenance (TACLP)
3. Learning Health System
“ ... one in which progress in science,
informatics, and care culture align to generate
new knowledge as an ongoing, natural by-
product of the care experience, and
seamlessly refine and deliver best practices for
continuous improvement in health and health
care.” (Institute of Medicine)
We can’t afford to waste data!
4. 2!
2
A Learning Health System
for the Nation
Pharmaceutical Firm
Beacon
Community
Integrated
Delivery
System
Community
Practice
Health'Informa. on'Organiza. on'
Health Center
Network
Federal
Agencies
State Public Health
Governance
Patient Engagement
Trust
Analysis
Dissemination
Learning Health System
Defining functions of a LHS are to:
1.routinely and securely aggregate data from disparate sources
2.convert the data to knowledge
3.disseminate that knowledge, in actionable forms, to everyone who can
benefit from it.
c/o C. Friedman
5. Learning Health System take-up
• US medical/academic centres
o Mayo, Duke, Vanderbilt
o PCORI
• National data aggregators
o Clinical Practice Research Datalink
o NIVEL
• EHR vendors
o CSC, Asseco, TPP, InPractice Systems
• European academic-industrial
collaborations
o TRANSFoRm, EHR4CR, Semantic
HealthNet
…and Bill
6. Example: Clinical trial challenges
• Major motivation for the LHS work
• Trials too expensive and difficult to run
• Efficacy-effectiveness gap (EEG)
o Disconnect between outcomes from clinical trials and
information needed for clinical practice
o Interaction of drug effect and real-life contextual factors
o Challenge to identify contextual factors
• LHS provides context and workflow
7. LHS for Clinical Trials
• EHR integration
o Eligibility checking done automatically from EHR data
o eCRFs partially filled based on EHR information
o All collected data stored in the EHR system as well as the
research database
• Closing the loop
o eCRF data enriches the EHR
o Helps the clinician
o Adds value to the EHR system
• Data does not go to waste!
7
8. Trust in the LHS
• Research community is struggling to ensure transparency
and correctness of published research
• Reasons complex and interleaving (positive bias, intractable
analysis, deluge of journals)
• Bayer Healthcare team published work showing that only
25% of the academic studies they examined could be
replicated
o Prinz et al. Nat. Rev. Drug Discov. 10, 712, 2011
• Of 53 oncology studies from 2001-2011, each highlighting
big new apparent advances in the field, only 11% (6) could
be robustly replicated.
o Begley & Ellis Nature 483, 531–533, 2012
9. Trust in the LHS (cont.)
• The problem is by no means restricted to preclinical studies
• Twelve randomised clinical trials testing 52 observational claims and
failed to reproduce a single one
o Young SS, Karr A. Deming, data and observational studies. Significance sep
2011; 8(3):116–120
• Replication of 100 experiments published in 2008 in three high-
ranking psychology journals – less than one half of finding replicated
o Estimating the reproducibility of psychological science. Science Aug
2015;349(6251)
• Random sample of 441 biomedical journal articles 2000 – 2014: none
made all their data available, one provided full protocol, majority did
not disclose funding or conflicts of interest
o Iqbal et al. Reproducible Research Practices and Transparency across the
Biomedical Literature. PLoS biology 2016; 14(1)
• Cost of irreproducible research in life science is estimated at $28
billion per year in the U.S
o Freedman LP, Cockburn IM, Simcoe TS. The Economics of Reproducibility in
Preclinical Research. PLOS Biology jun 2015; 13(6)
10. • Each component in the
healthcare system
produces and consumes
data:
• Epidemiological research
using record linkages
• Research data embedded in
the EHR
• Decision support for
diagnosis
• Provenance infrastructure
required to support all
these domains
Data in the Learning Health System
Specific
research
data
Actionable
data
Routinely
collected
data
• Clinical trials
• Controlled
populations
• Well-defined
questions
• EHR systems
• Wide coverage
• Vast quantity
• May lack in
detail and quality
• Distilled scientific
findings
• Usable in clinical
practice
• Decision support
11. TRANSFoRm project
• €7.5M European Commission 2010-2015
• Funded under the Patient Safety Work Program of FP7
• Developing methods, models, services, validated
architectures and demonstrations to support:
o Epidemiological research using GP records, including genotype-
phenotype studies and other record linkages
o Clinical trials embedded in the EHR
o Decision support for diagnosis
www.transformproject.eu
12. Middleware
Secure data
transport
RCT tools
(Electronic Data
Collection)
Epidemiological
study tools
(Data queries)
Authentication
framework
Diagnostic support
tools
Data source
connectivity
module
Provenance
framework
Vocabulary
service
TRANSFoRm software landscape
13. Use case 1: Type 2 Diabetes
• Research Question: In type 2 diabetic
patients, are selected single nucleotide
polymorphisms (SNPs) associated with
variations in drug response to oral antidiabetic
drugs (Sulfonylurea)?
• Design: Case-control study
• Data: primary care databases (phenotype
data) pre-linked to genomic databases
(genetic risk factors) – data federation
14. Use case 2: Gastro-oesophageal reflux disease (GORD)
• Research Question: What gives the best symptom relief
and improvement in Quality of Life: continuous or on
demand Proton Pump Inhibitor use?
• Design: Randomised Controlled Trial (RCT)
• Data: Collection through EHR & web based questionnaire –
electronic case report forms AND mobile Patient Related
Outcome Measures
• Provenance and security
15. Use case 3: Diagnostic Decision Support
• Early diagnostic suggestions for presenting problems:
• chest pain
• abdominal pain
• shortness of breath
• Clinical Prediction Rule web service (with underlying
ontology)
• Prototype Decision Support System integrated with a
commercial electronic health record system
• Vision by InPractice Systems
16. Provenance challenge for TRANSFoRm
• Viable methods for adoption in a heterogeneous
software environment
o No shared workflow middleware to rely on
• Need to achieve domain specificity
• Able to demonstrate conformance to standards
o Title 21 of the Code of Federal Regulations; Electronic
Records; Electronic Signatures (21 CFR Part 11)
o Good Clinical Practice (GCP)
o EudraLex Vol. 4 Annex 11: Computerised Systems in EU
o CONSORT, STROBE, RECORD
17. Semantic annotations
• Semantic concepts in the provenance graph defined using
TRANSFoRm ontologies:
o Clinical Research Information Model (CRIM)
o Software infrastructure ontology
o Clinical evidence ontology
• Ontology concepts annotations on provenance nodes
• Provenance templates define domain actions that map to
provenance fragments
PCROM
(UML Model)
Randomised
Clinical Trial
Ontology
(RCTO)
Randomised
Clinical Trial
Provenance
Ontology
(RCTPO)
18. Provenance templates
Provenance database
Provenance server
Existing
tools
1. Tools are agnostic to provenance
representation
2. Service invocation matches some
provenance template in
Provenance server
3. Template is instantiated into a
provenance graph fragment with
OWL concept annotations
4. Graphs merged inside the
database
API service calls
OPM graphs annotated
with OWL
20. Provenance security
• Use a single provenance graph for:
o Full trial audit
o Reporting studies
o Publication review
o Collaborators
o Readers
• Need to abstract parts of the graph
• Access control and view generation for provenance
graphs
o Future Generation Computer Systems, Volume 49,
August 2015, Pages 8-27 Roxana Danger, Vasa Curcin,
Paolo Missier, Jeremy Bryans
21. Basic idea
• The aim of an access control strategy is not only to
determine if the resource can be viewed or not, but
to construct a view of the graph which satisfies the
security constraints
• The goal is for maximum amount of information to
be retained
• NB Based on TRANSFoRm use cases but not
implemented in the live system
22. Access control
• Ensuring that a principal (person, process, etc.) can
only access the services or data in a system that
they are authorized to
• Implemented through security policies that try to
enforce a certain protection goal such as to prevent
unauthorized disclosure (secrecy) and intentional or
accidental unauthorized changes (integrity)
• Authorizations for some resource can be:
o Positive (allow)
o Negative (deny)
23. Access control
• Two classical approaches:
o Closed policy
• deny-by-default
• Access to a resource is only granted if a corresponding positive
authorization policy exists
o Open policy
• Permit-by=default
• Access unless a corresponding negative authorization policy exists.
• Combined approach used to support policy exceptions
• Conflict resolution needed if multiple policies apply,
e.g.
o denials-take-precedence
o most-specific-takes- precedence
o priority levels
o time-dependent access.
24. Access control languages for provenance
• Qin Ni et al
o Semantic description of subjects (user roles) and resources
to be accessed
o conditions under which restrictions are applied,
o four different types of access permissions.
• Cadenhead et al
o Added regular expressions for resource and condition
descriptions
• Transformation-oriented Access Control Language for
Provenance (TACLP)
o Allows users to define subgraphs to be transformed, with
three different levels of abstractions (namely hide, minimal
and maximal).
26. External effects and causes
• External effects and causes of the set of nodes S
w.r.t. a set of nodes R
o Set of nodes that represent the immediate
effects/causes of S that would be affected by removal of
nodes in R from the graph V (𝑆 ⊆ 𝑅 ⊆ 𝑉)
o If S=R, then denote as ef(R) and ca(R)
28. Basic operations
• Node removal
o Subgraph needs to be hidden
o e.g. if it is unnecessary for an analysis or user access to it
has been restricted.
• Node replacement
o removing details of data and operations in a subgraph
while retaining some information (abstract entity) of the
existence of such subgraph.
29. Operation: node removal
• Let Prov = (V , E , type) and R ⊆ V be a set of nodes to be
removed. Result is a new provenance graph Prov′
=(V′,E′,type′), such that:
31. Abstract nodes and edges
• Dummy nodes introduced during entity
replacement
• Preserve the causality of the rest of the graph
• Two types of dependencies:
o Indirect
• Denoted with double lines
• Represent multi-step dependences (wdf+, u+, wgb+, wtb+)
o Soft dependencies
• Denoted with double dashed lines
• Generic transitive relationship which is not one of the above
34. False dependencies
• False dependencies introduce a previously non-
existent path in the new graph, e.g. removing A, B
35. Causality preserving transformation
• A transformation is called causality preserving if it
does not introduce false dependencies.
• Given a provenance graph and a set of entities to be
abstracted/hidden, the question is how can these
entities be joined or removed from the graph using
only causality-preserving transformations?
36. Causality preserving partition and transformation
• Given a set of nodes R ⊆ V, a causality preserving
partition ℘ of R is such that removing or replacing
any set of nodes 𝑃 ∈ ℘ will not introduce causal
dependencies
• A graph transformation by partition ℘ of R is then a
sequential application of Remp or Repp
• The necessary and sufficient condition for such
transformation to be causality preserving is that for
each 𝑃 ∈ ℘ all of P’s external causes and effects are
connected
37. Optimal causality preserving partition
• Default partition of R consists of singletons, i.e.
each node in R is a set in the partition.
• Optimal partition is such that none of its sets have
the same sets of external causes and effects w.r.t. R
• Partitioning algorithm
o Step 1, determine external causes and effects for default
partition
o Step 2, gradually merge the partitions until optimal.
38. Provenance graph transformation algorithm
• Once the partition is computed, the
transformations are iteratively applied to each
element in the partition
• Labels input provides names for generated abstract
nodes
• Levels input provides abstraction level for each
partition
o Hide
• remove operation
o Minimum abstraction, maximum abstraction
• replace operation
• isolated singletons removed as a special case.
39. Computational efficiency
• Transformation algorithm performance depends on
the performance of the partition algorithm
• The other steps are linear to cardinality of the set of
partitions ℘ and its edges
• The partition algorithm considers pair-wise
combinations of nodes.
• Overall complexity is O(R2), where R is the set of
nodes to abstract
40. Experimental results
• Provenance view transformation algorithm was
implemented in Python 2.7 using Networkx API.
• Experiments were executed on Ubuntu 12.04, Intel
Core i7-3687U CPU with 2.10GHz and 8GB RAM
• Synthetic provenance graphs used, randomly
generating edges for each node within the degree
range 2-10
• Two parameters:
o the percentage of nodes to abstract (from 5 to 25 with a step
5)
o the percentage of nodes to abstract which are causally
dependent (from 0 to 100 with a step of 25)
• Each configuration was executed 10 times and the plots
presented show the averages of these executions.
41. Performance behaviour
• Execution time (Y) in seconds as a function of the
number of nodes (X) and the percentage of nodes
to abstract (Z)
• Quadratic time
42. Use case: Access to health data
• Access control for the provenance data collected from an
Electronic Health Record (EHR) and clinical trial systems
• Rules:
o Auditors. Healthcare system auditors or law enforcement agencies can
access the whole provenance graph during the auditing process.
o Family doctors and patients. Electronic health records and their data
provenance can only be accessed by patients during weekends, and by
FDs during weekdays.
o Active FDs. Active FDs have access to the provenance data associated
with the EHRs of their patients and its provenance;
o Clinical trial 1. If some data comes from a clinical trial, the GP needs to
be participant of the trial to see the subgraph associated with that trial.
o Clinical trial 2. Patients do not have access to clinical trial processes.
o Laboratory. Patients do not have access to laboratory processes.
o Automatic diagnosis recommendation. Patients have no access to any
information related to the automatic diagnosis recommendation nor to
the graph segment connecting it with the clinical evidences.
43. TACLP
• Transformation-oriented Access Control Language
for Provenance (TACLP)
• Extends the works of Ni and Cadenhead by
introducing transformations
• A policy consists of:
o Target
o Effect
o Transformation
o Condition (optional)
o Obligation (optional)
44. TACLP Target
• Subject element
o Set of users (subject element) to which the policy should be
applied, expressed through IRI references
• Record element
o Set of resources to which the policy should be applied,
expressed through IRI references
• Restriction element (optional)
o A conditional expression under which the policy is applied
o Either a relational comparison between a value in a property
path and a literal, or a full logical expression.
• Scope element (optional)
o If the policy is ‘transferable’ or ‘non-transferable’ with
respect to subjects
o Whether it applies to all the ancestors of matched elements
in the graph, or to the matched elements only.
45. TACLP Effect
• Specifies the intended outcome
• Four possibilities:
o Absolute permit guarantees access to the graph regardless
of the effect of other policies
• e.g. for allowing access to auditors or law enforcement agencies, and
avoids the need for additional conditions in deny policies
o Deny guarantees that certain parts of the graph will not be
accessed by users in the subject element.
o Necessary permit is used to describe the necessary, but not
always sufficient, conditions for accessing certain parts of the
graphs
o Permit is used to describe those parts of the graph that can
be accessed if there are no other policies denying access to
it.
46. TACLP Transformation
• How to transform the provenance graph in order to
hide certain resources
• Specification of which nodes need to be hidden and
Removal/Replace operations to be applied to them
• Set of policies comprising
o Policy type (target, record, condition, effect,
transformation element and obligation)
o Policy evaluation type (deny- takes-precedence or
permit-takes-precedence)
47. TACLP Transformation
• Abstraction level
o Hide
• matched nodes of the subgraph have to be completely hidden
(removed) from the graph
• Remove transformation is applied;
o Minimum abstraction
• Replace transformation is applied
• No caused-by relationship (soft dependencies) will appear in
the transformed graph.
o Maximum abstraction
• Replace transformation is applied
• Soft dependencies can appear in the transformed graph.
48. Access control evaluation algorithms
• Aim to produce an abstracted graph that satisfies
the constraints
• Deny-takes-precedence
1. Absolute permit policies evaluated first
2. Necessary permit and deny policies
3. Permit policies
• Allow-takes-precedence
1. Absolute permit evaluated first
2. Necessary permit policies
3. Permit policies
4. Deny policies
51. Summary
• Learning Health System presenting new set of
challenges for medical and informatics communities
• Provenance can help establish trust in the LHS
• Methods needed to verify trust
• Abstraction of provenance traces needed to address
requirements of multiple stakeholders
o Researchers
o Regulators
o Publishers
• Future work
o Projects running on provenance of decision support and
visual analytics for health data
o Looking for partnerships to investigate applications of the
security work
The US health system is going digital ~30% now , ~80% by 2019
- In many EU countries primary care, 100% usage of EHRs, more than 50% completely paperless
• If each care provider, patient, researcher, used his/her own data only for immediate needs, we are failing to realize the potential
• If comparable data are shared, we can learn and improve
• The key is to figure out how to do this routinely.
We can’t afford to waste data.
The overall goal is a healthcare system that draws on the best evidence to provide the most appropriate care for each patient, focusing on prevention and health promotion, delivers the most value, and adds to learning and improvements with each care experience
LHS:
“ ... one in which progress in science, informatics, and care culture align to generate new knowledge as an ongoing, natural by-product of the care experience, and seamlessly refine and deliver best practices for continuous improvement in health and health care.” (Institute of Medicine)
Examples:
1. Nationwide post-market surveillance of a new drug quickly reveals that personalized dosage algorithms require modification. A modified decision support rule is created and is implemented in EHR systems.
2. During an epidemic, new cases reported directly from EHRs. As the disease spreads into new areas, clinicians are alerted.
3. A patient faces a difficult medical decision. She bases that decision on the experiences of other patients like her.
Key is to move beyond individual knowledge silos – there are some wonderful solutions out there, particularly in the US, which do brilliant work locally, but do not consider the interoperability with the wider world. Researchers are increasingly asking: how portable is this, and how can we pick it up?
Feedback loop
Part of a wider reproducibility challenge
Potential reasons:
incorrect or inappropriate stat analysis of results or insufficient sample sizes
pressure to publish sometimes results in negligence over the control or reporting of experimental conditions
bias towards publishing positive results
many initially rejected papers get published in other journals without substantial changes or improvements
Important not to overreact: Being unable to reproduce the findings does not automatically mean that the study is flawed, however it does open the research to questions.
Number of citations for the unreproducible findings actually outpaced those with reproducible findings! (ibid)
Front end tools sharing the same set of reusable components in middleware and data connectivity package
We start from domain ontologies, and map them to provenance ontologies, using OPM concepts
Our project predates the current W3C standard but mapping from OPM to PROV straightforward (the other way, not necessarily so)
Challenge was that our tools are heterogeneous, some are user-facing, and don’t share an execution environment.
Thus we introduced provenance templates.
Provenance server exports service interface based on the templates
Abstract provenance graph fragments with semantic annotations
Client applications provide details (investigators, data set references, study parameters)
Sent to the provenance interface and converted into full provenance graphs and stored in the database
This is a very non-intrusive way of embedding provenance into a software ecosystem.
Our work builds upon existing work in the field.
Key are generic causality relations and indirect relations
Indirect essentially composes the original relation with transitive closure of was derived from
R is the set that is getting removed, we are observing the effects and causes from its subset S
Highlight the difference when R=S
Entity removal transformation (RemR) is used when a subgraph needs to be hidden, e.g. if it is unnecessary for an analysis or user access to it has been restricted.
Entity replacement (RepR) is used for removing details of data and operations in a subgraph while retaining some information (abstract entity) of the existence of such subgraph.
Removal and replacement transformations do not introduce cycles in the new graph as long as the original graph is acyclic, as OPM provenance graphs are. However, using these transformations on an arbitrary set of nodes can introduce false dependencies, that is, causal links that were not present in the original graph.
Soft edge introduced in remove
Removal and replacement transformations do not introduce cycles in the new graph
However, using these transformations on an arbitrary set of nodes can introduce false dependencies, that is, causal links that were not present in the original graph.
entity replacement transformation introduces false dependencies when entities A and B are joined.
In this case, paths from 2 to 5, D, and E do not exist in the original graph.
Proof in the paper
Not in the live system
Remove this?
Use of c* - the most general connectivity in the provenance graph
The graph shows the evolution of an EHR of a patient during two visits and the subsequent actions. In the first visit the patient (Ag1) visited a general practitioner (Ag3) and an EHR system (Ag2) was used to record all the details of the visit. First, new item creation process (P1) executed, generating a new EHR version (EHR v20 - A2) based on the previous version of the patient’s EHR (EHR v19 - A1). After the patient detailed the symptoms, the GP gave them a prescription (A3) to be followed, created a blood test form (A4) for the test to be performed, and updated the data in the EHR system, generating a new version of the record (EHR v21 - A5). The blood test form was used to prepare the instrumentation and conduct the measurement (P3). All these operations were controlled by the laboratory System (Ag4) and a laboratory technician (Ag5). As part of this process, a laboratory condition report was generated (A6), and it triggered the blood test report creation process (P5), which generated the test report (A7) and a new version of the EHR containing the results of the test (EHR v22 - A9). The test and the laboratory condition reports (A7 and A6) were both used during the creation of an electronic Case Report Form, eCRF, (P4), as the patient is involved in a clinical Trial, and his progress is also followed by the Clinical Trial researcher (Ag6). The result of this action is the eCRF (A8).
In the second visit, a new EHR item process (P6) was executed again, producing the new version of the EHR (A10). Followed this, the doctor used a decision support system (Ag10) to confirm their diagnostic hypothesis. They opened the application, entered the patient details (P7), and a set of diagnostic cues (A11) that were extracted from the EHR of the patient. These were then compared (P8) with the clinical evidence repository (A12) of the decision support system. A diagnostic recommendation (A13) was then obtained and given as a possible option to the GP, who used it to generate their final diagnosis (A15). A variable containing the recommendation chosen by the GP (A14) is also generated and maintained by the decision support tool. Once the GP had the diagnosis, they proceed to update the data in the EHR system, generating a new prescription for the patient (A16), and a new version of the EHR (A17).
Notice that the labels properly describe the aim of the abstracted entities in the cases of laboratory and clinical trial, and the whole subgraph corresponding to the automatic diagnosis decision support processing is removed.
Ultimately, LHS is about scaling up of the health system, and consequently the associated research that health system is built upon. If this scaling is to succeed we have to install mechanisms to verify trust in the system inside our research instruments. In the research world increasingly reliant on electronic tools, provenance gives us a lingua franca to achieve traceability, which we have shown to be essential to building these mechanisms. The idea was evaluated in a provenance infrastructure that was implemented in the TRANSFoRm project in three distinct LHS domains, those of clinical trials, decision support systems and cohort studies. The challenge now is to address the provenance gap that exists between the provenance metadata collected and the reporting requirements of different domains, and this will require a joint effort by a range of stakeholders, including medical scientists, informaticians, publishers and regulators. However, this work is essential if the quality of translation from research into practice in the LHS is to improve with the growing volume of data and research and not deteriorate.