1. Using CWL to support EHR-based phenotyping
Martin Chapman
King’s College London
2. EHR-based phenotyping i
Data mining rebranded for use with electronic health records (EHRs).
Simplest example of EHR-based phenotyping process: flag patients with a certain clinical code
as having a disease (e.g. COVID-19).
PatientID Clinical Code
001 C19
002 COPD
003 C19
004 -
Figure 1: EHRs at a given clinic
1
3. EHR-based phenotyping i
Data mining rebranded for use with electronic health records (EHRs).
Simplest example of EHR-based phenotyping process: flag patients with a certain clinical code
as having a disease (e.g. COVID-19).
PatientID Clinical Code Has COVID-19?
001 C19 ✓
002 COPD -
003 C19 ✓
004 - -
Figure 1: EHRs at a given clinic
1
4. EHR-based phenotyping ii
A slightly more complex phenotyping processes might look at multiple criteria. It might
consider a patient to have a given condition if any of these criteria are true.
For example, if a patient has a code from one of a number of different coding schemes:
PatientID Clinical Code
001 ICD-C19
002 COPD
003 SNOMED-C19
004 -
Figure 2: EHRs at a given clinic
2
5. EHR-based phenotyping ii
A slightly more complex phenotyping processes might look at multiple criteria. It might
consider a patient to have a given condition if any of these criteria are true.
For example, if a patient has a code from one of a number of different coding schemes:
PatientID Clinical Code ICD-10 code?
001 ICD-C19 ✓
002 COPD -
003 SNOMED-C19 -
004 - -
Figure 2: EHRs at a given clinic
2
6. EHR-based phenotyping ii
A slightly more complex phenotyping processes might look at multiple criteria. It might
consider a patient to have a given condition if any of these criteria are true.
For example, if a patient has a code from one of a number of different coding schemes:
PatientID Clinical Code ICD-10 code? SNOMED code?
001 ICD-C19 ✓ -
002 COPD - -
003 SNOMED-C19 - ✓
004 - - -
Figure 2: EHRs at a given clinic
2
7. EHR-based phenotyping ii
A slightly more complex phenotyping processes might look at multiple criteria. It might
consider a patient to have a given condition if any of these criteria are true.
For example, if a patient has a code from one of a number of different coding schemes:
PatientID Clinical Code ICD-10 code? SNOMED code? Has COVID-19?
001 ICD-C19 ✓ - ✓
002 COPD - - -
003 SNOMED-C19 - ✓ ✓
004 - - - -
Figure 2: EHRs at a given clinic
2
8. Phenotype definitions
The EHR-based phenotyping process is captured as a phenotype definition (abstract and
non-executable logic), and in turn implemented for use in practice as a computable
phenotype (concrete and executable implementation).
EHR
ICD-10 code
SNOMED code
CASE
Yes
Yes
SELECT UserID, Codes
FROM Patients
WHERE Codes IN (’ICD−C19’, ’SNOMED−CD19’);
...
Phenotype definition (flowchart) Computable phenotype (SQL)
3
9. Challenges and CWL solutions i
Wider phenotype definition and computable phenotype landscape is more complex:
portal.caliberresearch.org phekb.org
4
10. Challenges and CWL solutions ii
• Phenotype definitions come in lots of different forms (flowcharts, text descriptions,
weights for a classifier, etc.) and lack standardisation. This reduces intelligibility and
thus phenotypic reproducibility (the ability to accurately implement the logic intended
by the definition author).
• Computable phenotypes often don’t exist at all. This affects phenotypic portability
(the effort associated with implementing a definition).
5
11. Challenges and CWL solutions ii
• Phenotype definitions come in lots of different forms (flowcharts, text descriptions,
weights for a classifier, etc.) and lack standardisation. This reduces intelligibility and
thus phenotypic reproducibility (the ability to accurately implement the logic intended
by the definition author).
A new model to structure definitions based on CWL.
• Computable phenotypes often don’t exist at all. This affects phenotypic portability
(the effort associated with implementing a definition).
5
12. Challenges and CWL solutions ii
• Phenotype definitions come in lots of different forms (flowcharts, text descriptions,
weights for a classifier, etc.) and lack standardisation. This reduces intelligibility and
thus phenotypic reproducibility (the ability to accurately implement the logic intended
by the definition author).
A new model to structure definitions based on CWL.
• Computable phenotypes often don’t exist at all. This affects phenotypic portability
(the effort associated with implementing a definition).
An architecture—Phenoflow—to parse definitions under our new model and make them
available to researchers to download in CWL.
5
14. Why a workflow?
All phenotype definitions can be considered as, or reduced to, a set of steps, which start
with a patient population, apply a number of criteria to that population, and, depending on
the relationship between those criteria, determine cases of the disease.
1. For simpler definitions, considered here, if any of the criteria are met, a patient is
considered a case, and this can be flagged by individual steps.
6
15. Why a workflow?
All phenotype definitions can be considered as, or reduced to, a set of steps, which start
with a patient population, apply a number of criteria to that population, and, depending on
the relationship between those criteria, determine cases of the disease.
1. For simpler definitions, considered here, if any of the criteria are met, a patient is
considered a case, and this can be flagged by individual steps.
2. For more complex definitions, where multiple criteria must all be met, or meeting a
criterion (e.g. being under a certain age) actually should exclude an individual from
having a condition, this can be determined at the end of the workflow, before a final
cohort is produced. Here, the sequential nature of a workflow is important.
6
16. Why a workflow?
All phenotype definitions can be considered as, or reduced to, a set of steps, which start
with a patient population, apply a number of criteria to that population, and, depending on
the relationship between those criteria, determine cases of the disease.
1. For simpler definitions, considered here, if any of the criteria are met, a patient is
considered a case, and this can be flagged by individual steps.
2. For more complex definitions, where multiple criteria must all be met, or meeting a
criterion (e.g. being under a certain age) actually should exclude an individual from
having a condition, this can be determined at the end of the workflow, before a final
cohort is produced. Here, the sequential nature of a workflow is important.
3. Nested workflows can be used to handle complex branches (more shortly).
6
17. CWL-based model i
A new CWL-based model for the definition of a phenotype:
number group id description type
step
Input Output
id description id description extensionA
pathA languageA paramsA
implementationUnitA
Computational
Implementation
Units
pathB languageB paramsB
implementationUnitB
Abstract
Functional
Figure 3: CWL-based definition model (step) and implementation units*.
*the bits of code actually executed by definitions structured under this model; separate from the model
itself.
7
18. CWL-based model ii
Model is separated into layers:
• Abstract Expresses the logic of a phenotype through a set of simple sequential,
potentially nested steps, each of which is annotated with multiple descriptions. Emphasis
on intelligibility.
8
19. CWL-based model ii
Model is separated into layers:
• Abstract Expresses the logic of a phenotype through a set of simple sequential,
potentially nested steps, each of which is annotated with multiple descriptions. Emphasis
on intelligibility.
• Functional Specifies the metadata of entities passed between the operations within the
abstract layer, e.g., the format of an intermediate cohort.
8
20. CWL-based model ii
Model is separated into layers:
• Abstract Expresses the logic of a phenotype through a set of simple sequential,
potentially nested steps, each of which is annotated with multiple descriptions. Emphasis
on intelligibility.
• Functional Specifies the metadata of entities passed between the operations within the
abstract layer, e.g., the format of an intermediate cohort.
• Computational Defines an environment for the execution of one or more
implementation units (e.g. a script, data pipeline module, etc.) for each step in the
abstract layer. Inherently supports implementation by providing a template for
development.
8
21. CWL-based model iii
2 - icd10 A case is identified in the presence of pa-
tients associated with the stated icd10
COVID-19 codes.
logic
step
Input Output
covid19 cohort Potential covid19
cases.
covid19 cases icd10 covid19 cases, as
identified by icd10
coding.
csv
icd10.py python -
for row in c s v r e a d e r :
newRow = row . copy ()
for c e l l in row :
i f [ value for value in
row [ c e l l ] . s p l i t ( ” , ” )
i f value in codes ] :
newRow [ ” covid19 ” ] = ”CASE”
...
Computational
Implementation
Units
icd10.js javascript -
for ( row of csvData ){
newRow = row . s l i c e ( ) ;
for ( c e l l of row ){
i f ( c e l l . s p l i t ( ” , ” )
. f i l t e r ( code=>codes .
indexOf ( code )>−1). length ){
newRow . push ( ”CASE” ) ;
...
Abstract
Functional
Figure 4: Individual step of COVID-19 phenotype definition and implementation units.
9
22. Relationship to CWL i
‘Informal subset’ of CWL, specified using step type metadata:
• The first step must be of a connector type (currently load or external), designed to
extract data from a datasource without performing any processing on that data, and pass
it to the second step.
• Other steps in a definition must describe the logic of the phenotype (types currently
boolean logic and generic logic (supporting, for example, case exclusion)).
• The final step must be of an output type, outputting a final condition cohort to disc,
taking into account any relationships between boolean steps (e.g all must be true).
More: https://github.com/kclhi/phenoflow/wiki/Model
10
24. Other model benefits
Beyond standardising definitions (and thus improving phenotypic reproducibility), a CWL-based
model provides us with a number of other benefits:
• As we’ve already seen, we can have different implementations for the same definition.
• Often different sites will realise the same phenotype logic using different implementation
units, and we want to map these to the original logic.
• Connecting, yet keeping separate, the definition and the implementation
• Important when phenotype implementations and definitions are often conflated.
• Support for a wide range of definition types.
• From simple codelists (as seen) to trained classifiers. Implementation units can also differ
across steps, if needed. Enabled via CWL’s Docker integration (more shortly).
12
26. Phenoflow: Parsing and CWL generation architecture
Our architecture, Phenoflow, allows us to take non-standard phenotype definitions,
standardise them, and make them available for download in CWL.
Web Portal/API
Generator
Visualiser
Implementation
Units
VC server
Author(s)
User
customise
workflow,
visualisation,
implementation units
author,
expand
data
workflow
workflow
visualisation
Chapman, Martin, et al. “Phenoflow: A microservice architecture for portable workflow-based
phenotype definitions.” AMIA, 2021. https://kclhi.org/phenoflow
13
27. Phenoflow: Parsing and CWL generation architecture
Our architecture, Phenoflow, allows us to take non-standard phenotype definitions,
standardise them, and make them available for download in CWL.
Web Portal/API
Generator
Visualiser
Implementation
Units
VC server
Author(s)
User
customise
workflow,
visualisation,
implementation units
author,
expand
data
workflow
workflow
visualisation
Chapman, Martin, et al. “Phenoflow: A microservice architecture for portable workflow-based
phenotype definitions.” AMIA, 2021. https://kclhi.org/phenoflow
13
28. Parsing
When an author submits data relating to a definition (e.g. a codelist) via the API (or we pull
this data from existing libraries):
1. Key information that forms the logic of the definition
(e.g. a ‘conceptid’ column in a codelist CSV) is identified.
2. This information is used to automatically determine the number of
steps and their content (e.g. by grouping codes according to coding scheme)
3. Implementation units for each step are automatically created.
4. Implementation units are added to the
Phenoflow library ready to be used within a workflow.
5. Subsequent definition edits are tracked using a Data Provenance Template server.
14
29. Parsing: Implementation unit creation (Step 4) i
To support the creation of implementation units automatically, we developed templates with
placeholder values, that are then populated as a part of the parsing process.
• Our simplest template substitutes an array of codes, each of which can then be identified
within an EHR:
codes = [[LIST]];
...
with open(sys.argv[1], ’r’) as file in,
open(’[PHENOTYPE]−potential−cases.csv’, ’w’, newline=’’) as file out:
...
• Templates for more complex definition types are based upon existing phenotype
implementations (e.g. Python NLP phenotyping at KCL GSTT, clustering techniques).
15
30. Parsing: Implementation unit creation (Step 4) ii
Each populated template will eventually be executed using a CommandLineTool in CWL.
• Support for different types of definitions (from simple codelists to trained classifiers) is
provided by creating custom Docker images, which provide specific language and package
support. These are then later used by each tool to execute the implementation unit.
16
31. Parsing: Provenance (Step 5)
Our architecture includes a Data Provenance Template server, a piece of software that holds
structured fragments of provenance.
These fragments record the evolution
of definitions within Phenoflow, as
they are edited by users.
Designed to complement CWLProv,
which records workflow execution.
used used wasAssociatedWith
wasGeneratedBy
var:updated
prov:end vvar:time
prov:type phenoflow#Updated
zone:id update
var:author
prov:type phenoflow#Author
var:phenotypeAfter
phenoflow:description vvar:description
phenoflow:name vvar:name
prov:type phenoflow#Phenotype
var:phenotypeBefore
prov:type phenoflow#Phenotype
var:step
phenoflow:coding vvar:coding
phenoflow:doc vvar:doc
phenoflow:position vvar:position
phenoflow:stepName vvar:stepName
phenoflow:type vvar:type
prov:type phenoflow#Step
zone:id update
Fairweather, Elliot, et al. “A delayed instantiation approach to template-driven provenance for
electronic health record phenotyping”. IPAW, 2020.
17
32. Phenoflow: Parsing and CWL generation architecture
Our architecture, Phenoflow, allows us to take non-standard phenotype definitions,
standardise them, and make them available for download in CWL.
Web Portal/API
Generator
Visualiser
Implementation
Units
VC server
Author(s)
User
customise
workflow,
visualisation,
implementation units
author,
expand
data
workflow
workflow
visualisation
Chapman, Martin, et al. “Phenoflow: A microservice architecture for portable workflow-based
phenotype definitions.” AMIA, 2021. https://kclhi.org/phenoflow
18
33. Phenoflow library
At the end of the parsing process, we effectively have a set of database entries (and a set of
implementation units), containing the information required to generate a workflow. This forms
the library:
19
34. Phenoflow library: Additional implementation units
After an initial import, additional implementation units can be added by other users creating
the ability to customise the workflow to download.
Once a permuatation is selected, a CWL workflow of that permutation can be generated on
the fly by a user:
20
35. Phenoflow: Parsing and CWL generation architecture
Our architecture, Phenoflow, allows us to take non-standard phenotype definitions,
standardise them, and make them available for download in CWL.
Web Portal/API
Generator
Visualiser
Implementation
Units
VC server
Author(s)
User
customise
workflow,
visualisation,
implementation units
author,
expand
data
workflow
workflow
visualisation
Chapman, Martin, et al. “Phenoflow: A microservice architecture for portable workflow-based
phenotype definitions.” AMIA, 2021. https://kclhi.org/phenoflow 21
36. Generation
When a user clicks download:
1. Pass information related to the chosen workflow
permutation to the generator, receive CWL files back in response.
2. Create a version of this workflow in a local Git server
3. Pass a link to this versioned code to the visualiser and receive a graphic back
4. Combine the CWL files, implementation units and visualised work-
flow (to increase intelligibility) into a zip and push to use for download.
5. User sets configuration details within the implementation units (e.g. database
credentials) and then locally executes the workflow against their target datasource.
22
37. Generation: On-the-fly workflow generation (Step 1)
Created a lightweight service wrapper around CWL generator in order to allow it to be called
in realtime, and generate a workflow based on the information stored as a part of the parsing
process.
@app.route(’/generate’, methods=[’POST’])
async def generate(request):
try:
steps = await request.json();
except:
steps = None;
if(steps):
generatedWorkflow = generateWorkflow(steps);
return JSONResponse({
’workflow’: yaml.dump(generatedWorkflow[’workflow’] ... )
});
...
...
https://github.com/kclhi/
phenoflow/tree/master/
generator
23
39. Impact
The use of CWL in this way has already had some impact:
1. We are connected to the HDRUK phenotype library
(https://phenotypes.healthdatagateway.org/), and automatically provide
implementations for their 1000+ phenotype definitions.
2. We are actively working with and/or in conversation with several sites in the US to
represent their definitions
3. Phenoflow has been used to represent some recent complex phenotypes, e.g. Long
Covid (Mayor, Nikhil, et al. “Developing a Long COVID Phenotype for Postacute
COVID-19 in a National Primary Care Sentinel Cohort: Observational Retrospective
Database Analysis”. JMIR, 2022.).
More to do! We are always looking for new phenotype definitions to increase the
sophistication of our parsing process.
24
40. Things we could probably do better i
1. Generation overhead
It was first believed that the style of ‘on the fly’ generation used in Phenoflow was required
due to all the possible permutations of implementation units that could be selected.
In reality, we have determined that the overhead associated with generating the corresponding
CWL for these permutations in advance is less than the delay to a user.
25
41. Things we could probably do better ii
As such, we are now shifting our architecture to instead use Github as a store for
pre-generated workflows produced as a part of the parsing (or editing) process.
API Generator
Visualiser GitHub
Author(s)
User
query
link to workflow
+ implementation units and
visualisation
author,
expand data
workflow
index
workflows
Hope to progress a fork
of CWL Viewer that
effectively acts as the
web portal by visualis-
ing (and indexing) the
available Git reposito-
ries.
26
42. Things we could probably do better iii
2. Generator version
We are using the original CWL generator (python-cwlgen), but should now, instead, be using
cwl-utils.
27
43. Things we could probably do better iv
3. Branch handling
As a part of our parsing process, we ‘flatten’ branches into individual steps if they are simple,
and into entire nested workflows if they are more complex.
Each branch evaluates to a boolean value, rep-
resenting whether the logic it contains suggests
that a patient has the condition. Then, much
like the simpler examples we’ve seen, if any of
the steps return true, the patient is deemed to
have the condition.
May well be a more sophisticated way to do this
in CWL.
28
44. Things we could probably do better v
4. The CWL itself!
$namespaces:
s: http://phenomics.kcl.ac.uk/phenoflow/
baseCommand: python
class: CommandLineTool
cwlVersion: v1.0
doc: Identify COVID−19 (ICD−10)
id: icd10
inputs:
− doc: Python implementation unit
id: inputModule
inputBinding:
position: 1
type: File
− doc: Potential cases of covid−19.
id: potentialCases
inputBinding:
position: 2
type: File
outputs:
− doc: Patients with ICD−10 COVID−19 codes
id: output
outputBinding:
glob: ’∗.csv’
type: File
requirements:
DockerRequirement:
dockerPull: kclhi/python:latest
s:type: logic
29
45. Summary
• We standardise existing phenotype definitions under a CWL-based model.
• These standardised definitions are presented to users as a part of the Phenoflow library.
• CWL files themselves are generated in realtime when a user downloads a given definition
from the library.
Thank you! Things like CWL’s Docker integration and the generation and visualisation tools
have been invaluable.
30
46. Links
Links given throughout the presentation:
Live: https://kclhi.org/phenoflow
Source: https://github.com/kclhi/phenoflow
Wiki: https://github.com/kclhi/phenoflow/wiki
31