Research data management for medical data with pyradigm.
Python data structure for biomedical data to manage multiple tables linked via patient info or other washable IDs. Allowing continuous validation, this data structure would improve ease of use as well as integrity of the dataset.
7. Dataset Lifecyle in ML
Input or RAW
data
on the disk
folder
hierarchy
meta data etc
Intermediate 1
various types outputs 1
widely varying
formats
Intermediate 2
diversity of
needs
diversity of
users
Outputs 2
Outputs 3
3
8. Dataset Lifecyle in ML
Input or RAW
data
on the disk
folder
hierarchy
meta data etc
Intermediate 1
various types outputs 1
widely varying
formats
Intermediate 2
diversity of
needs
diversity of
users
Outputs 2
Outputs 3
3
9. Challenges in RDM for Medical Data
4
Too many tables to manage
even for a small project!
11. Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
4
12. Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
4
13. Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
4
14. Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
4
15. Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
4
16. Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
• Recipe for disaster
• Features can get mixed up easily
• Having to look into multiple scripts
and word documents to figure where
is what, what they mean, and
whether they all properly linked is a
nightmare!
4
17. Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
• Recipe for disaster
• Features can get mixed up easily
• Having to look into multiple scripts
and word documents to figure where
is what, what they mean, and
whether they all properly linked is a
nightmare!
• Library built to reduce my own pain!
4
19. Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
5
20. Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
• Domain adaptation
• There is always few domain-specific minor issues we need to handle
• Preprocessing, naming, validation etc
• Diversity of data types
• Hashable ID : integers, alphanumeric etc
• Features : simple vector of numbers, or more structured data like graphs, trees
• Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc)
5
21. Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
• Domain adaptation
• There is always few domain-specific minor issues we need to handle
• Preprocessing, naming, validation etc
• Diversity of data types
• Hashable ID : integers, alphanumeric etc
• Features : simple vector of numbers, or more structured data like graphs, trees
• Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc)
• Trying to reduce “data entropy” in key parts of RDM
• Reminder: Data lasts MUCH LONGER than the project itself!
5
22. Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etc
X
23. Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etc
X
ktargets
y
24. Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
25. Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
26. Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
mattributes
A
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
27. Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
mattributes
A
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
covariates or
confounds
such as age,
gender, site,
Usually
scalars, but
sometimes
vectors too!
30. Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
7
31. Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
7
32. Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
7
33. Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
7
34. Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
• RegressionDataset
• Target: continuous float value : disease
severity score, age etc
7
35. Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
• RegressionDataset
• Target: continuous float value : disease
severity score, age etc
• Many other possibilities
• Depending on domain and use-case
7
36. Implementation details contd.
• Classes and data are managed via dict of dicts
• Convenience for developers
• Current setup is fine for our domain:
• 1000s of rows, few 1000s of columns
• Larger and more complex domains need fine-tuning
• Serialization
• Pickle files
• HDF etc are possible
• Requesting help from contributors
8
38. Usage
9
• Once it is built by someone, no
one else has to worry about it.
• They can slice and dice it in any
number of ways they desire
• You don’t need the script that
built this data structure, as it’s
more or less self-explanatory
40. Dataset iteration & arithmetic
10
• Lot more intuitive!
• Higher-level organization, not
rows, columns and comments!
• Meta-data gets propagated
automatically ! life is much
more productive!
• Achieving this with CSVs is a
huge pain!
43. Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
11
44. Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
• Errors are caught early!
• Instead of much later e.g. using some other toolbox
and then having to painfully trace it back
11
45. Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
• Errors are caught early!
• Instead of much later e.g. using some other toolbox
and then having to painfully trace it back
• Improves integrity
11