SlideShare a Scribd company logo
1 of 53
Download to read offline
Research data management for
medical data with pyradigm
Pradeep Reddy Raamana
crossinvalidation.com github.com/raamana
Research data management
2
Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Goal:	reduce	data	entropy	in	few	parts	of	this	lifecycle!		
	
Data	entropy:	Normal	degradation	in	information	content	
associated	with	data	and	metadata	over	time	©	Data	One
Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Goal:	reduce	data	entropy	in	few	parts	of	this	lifecycle!		
	
Data	entropy:	Normal	degradation	in	information	content	
associated	with	data	and	metadata	over	time	©	Data	One
Research data management
Plan
Create
Process
AnalyzePreserve
Share
Reuse
2
Goal:	reduce	data	entropy	in	few	parts	of	this	lifecycle!		
	
Data	entropy:	Normal	degradation	in	information	content	
associated	with	data	and	metadata	over	time	©	Data	One
I focus not on files, but derived
features: tables for machine
learning: Research Feature
Management (RFM)?
Dataset Lifecyle in ML
Input or RAW
data
on the disk
folder
hierarchy
meta data etc
Intermediate 1
various types outputs 1
widely varying
formats
Intermediate 2
diversity of
needs
diversity of
users
Outputs 2
Outputs 3
3
Dataset Lifecyle in ML
Input or RAW
data
on the disk
folder
hierarchy
meta data etc
Intermediate 1
various types outputs 1
widely varying
formats
Intermediate 2
diversity of
needs
diversity of
users
Outputs 2
Outputs 3
3
Challenges in RDM for Medical Data
4
Too many tables to manage
even for a small project!
Challenges in RDM for Medical Data
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
• Recipe for disaster
• Features can get mixed up easily
• Having to look into multiple scripts
and word documents to figure where
is what, what they mean, and
whether they all properly linked is a
nightmare!
4
Challenges in RDM for Medical Data
• Mixed data types
• Primary & secondary features
• Several attributes of mixed types
• Multiple tables need to be linked
and integrated with unique IDs
• Provenance needs to be captured
• Processing steps
• Their meta data
• Ad hoc scripts to read and manage
CSVs do not work at all
• Frequent change of hands
• Students, RAs, Staff etc
• With limited training
• Recipe for disaster
• Features can get mixed up easily
• Having to look into multiple scripts
and word documents to figure where
is what, what they mean, and
whether they all properly linked is a
nightmare!
• Library built to reduce my own pain!
4
Need for accessibility and domain adaptation
5
Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
5
Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
• Domain adaptation
• There is always few domain-specific minor issues we need to handle
• Preprocessing, naming, validation etc
• Diversity of data types
• Hashable ID : integers, alphanumeric etc
• Features : simple vector of numbers, or more structured data like graphs, trees
• Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc)
5
Need for accessibility and domain adaptation
• Existing libraries for data table e.g. pandas etc add a big barrier
• Cognitive burden
• Too contrived
• Terminology misuse, mistaken use
• Domain adaptation
• There is always few domain-specific minor issues we need to handle
• Preprocessing, naming, validation etc
• Diversity of data types
• Hashable ID : integers, alphanumeric etc
• Features : simple vector of numbers, or more structured data like graphs, trees
• Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc)
• Trying to reduce “data entropy” in key parts of RDM
• Reminder: Data lasts MUCH LONGER than the project itself!
5
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etc
X
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etc
X
ktargets
y
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
mattributes
A
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
Core structure of pyradigm
6
p features !
Nsamplets!
• Common predictive modeling, machine
learning and biomarker workflows need to
deal with
• multiple types of data
• features
• targets
• confounds
• multiple data types
• numerical, categorical
• scalar, vector
• potentially differing in length
• This is challenging and erroneous!
• esp. when multiple experiments and
comparisons are performed e.g. across
different sub-groups, different targets,
different covariate regressions etclink between diverse types of data
for the same ID / hash / subject
X
ktargets
y
mattributes
A
continuous:
score, severity,
age etc !
regression
categorical
(healthy vs
disease) !
classification
covariates or
confounds
such as age,
gender, site,
Usually
scalars, but
sometimes
vectors too!
Implementation details
7
Implementation details
• BaseDataset
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
• RegressionDataset
• Target: continuous float value : disease
severity score, age etc
7
Implementation details
• BaseDataset
• Abstract base class, defining the
coarse structure and properties
• A collection of hashable IDs (dict keys)
• each ID expecting data and a target
• and an optional set of attributes
• Validation: different types of
• Methods
• Add, summarize, retrieve, delete
• Arithmetic: combine, transform etc
• Sampling: by target values, by attribute
properties, or randomly
• Exporting to different formats
• Few derived classes
• Specific conditions on target
properties, such as whether it is
categorical or numerical
• ClassificationDataset
• Target : often a string (healthy,
disease), or an integer (-1, 1, 2)
• RegressionDataset
• Target: continuous float value : disease
severity score, age etc
• Many other possibilities
• Depending on domain and use-case
7
Implementation details contd.
• Classes and data are managed via dict of dicts
• Convenience for developers
• Current setup is fine for our domain:
• 1000s of rows, few 1000s of columns
• Larger and more complex domains need fine-tuning
• Serialization
• Pickle files
• HDF etc are possible
• Requesting help from contributors
8
Usage
9
Usage
9
• Once it is built by someone, no
one else has to worry about it.
• They can slice and dice it in any
number of ways they desire
• You don’t need the script that
built this data structure, as it’s
more or less self-explanatory
Dataset iteration & arithmetic
10
Dataset iteration & arithmetic
10
• Lot more intuitive!
• Higher-level organization, not
rows, columns and comments!
• Meta-data gets propagated
automatically ! life is much
more productive!
• Achieving this with CSVs is a
huge pain!
Advantages
11
Advantages
• Intuitive for niche domains: easy to use and teach
11
Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
11
Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
• Errors are caught early!
• Instead of much later e.g. using some other toolbox
and then having to painfully trace it back
11
Advantages
• Intuitive for niche domains: easy to use and teach
• Continuous validation
• As part of .add_samplet() or .add_attr() etc
• infinite or invalid or unexpected values
• Duplicate rows, columns of all 0s
• Allows arbitrary user-defined or domain-specific checks!
• Errors are caught early!
• Instead of much later e.g. using some other toolbox
and then having to painfully trace it back
• Improves integrity
11
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
mattributes
A
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
mattributes
A
p1features!
X1
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
p2
features
X2
mattributes
A
p1features!
X1
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
p2
features
X2
p3
X3
mattributes
A
p1features!
X1
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
p2
features
X2
p3
X3
X4
p4
features
mattributes
A
p1features!
X1
Advanced use cases: MultiDataset
12
Nsamplets!
k targets
y
p2
features
X2
p3
X3
X4
p4
features
mattributes
A
p1features!
X1
Thank you
• Check it out here
github.com/raamana


• Follow me @ twitter.com/raamana_
• Contributors most welcome.
13

More Related Content

What's hot

Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaPaul Groth
 
Data and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planData and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planC. Tobin Magle
 
UA Fam med fellows 2011 oct
UA Fam med fellows 2011 octUA Fam med fellows 2011 oct
UA Fam med fellows 2011 octjdondoyle
 
Ontologies: Necessary, but not sufficient
Ontologies: Necessary, but not sufficientOntologies: Necessary, but not sufficient
Ontologies: Necessary, but not sufficientrobertstevens65
 
Publishing Germplasm Vocabularies as Linked Data
Publishing Germplasm Vocabularies as Linked DataPublishing Germplasm Vocabularies as Linked Data
Publishing Germplasm Vocabularies as Linked DataValeria Pesce
 
Data and Donuts: Data organization
Data and Donuts: Data organizationData and Donuts: Data organization
Data and Donuts: Data organizationC. Tobin Magle
 
Metadata Usage Tendencies in Latin American Electronic Journals
Metadata Usage Tendencies in Latin American Electronic JournalsMetadata Usage Tendencies in Latin American Electronic Journals
Metadata Usage Tendencies in Latin American Electronic JournalsRolando Coto
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesSyed Muhammad Ali Hasnain
 
Don't miss important references
Don't miss important referencesDon't miss important references
Don't miss important referencesMartinBeeson
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Anita de Waard
 
Reviewing and refining the results of your literature search
Reviewing and refining the results of your literature searchReviewing and refining the results of your literature search
Reviewing and refining the results of your literature searchMartinBeeson
 
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...robertstevens65
 
Introduction to Grey literature for Health Sciences
Introduction to Grey literature for Health SciencesIntroduction to Grey literature for Health Sciences
Introduction to Grey literature for Health SciencesFranklin Sayre
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceAlasdair Gray
 

What's hot (14)

Knowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPediaKnowledge Graph Construction and the Role of DBPedia
Knowledge Graph Construction and the Role of DBPedia
 
Data and Donuts: How to write a data management plan
Data and Donuts: How to write a data management planData and Donuts: How to write a data management plan
Data and Donuts: How to write a data management plan
 
UA Fam med fellows 2011 oct
UA Fam med fellows 2011 octUA Fam med fellows 2011 oct
UA Fam med fellows 2011 oct
 
Ontologies: Necessary, but not sufficient
Ontologies: Necessary, but not sufficientOntologies: Necessary, but not sufficient
Ontologies: Necessary, but not sufficient
 
Publishing Germplasm Vocabularies as Linked Data
Publishing Germplasm Vocabularies as Linked DataPublishing Germplasm Vocabularies as Linked Data
Publishing Germplasm Vocabularies as Linked Data
 
Data and Donuts: Data organization
Data and Donuts: Data organizationData and Donuts: Data organization
Data and Donuts: Data organization
 
Metadata Usage Tendencies in Latin American Electronic Journals
Metadata Usage Tendencies in Latin American Electronic JournalsMetadata Usage Tendencies in Latin American Electronic Journals
Metadata Usage Tendencies in Latin American Electronic Journals
 
Processing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web TechnologiesProcessing Life Science Data at Scale - using Semantic Web Technologies
Processing Life Science Data at Scale - using Semantic Web Technologies
 
Don't miss important references
Don't miss important referencesDon't miss important references
Don't miss important references
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
Reviewing and refining the results of your literature search
Reviewing and refining the results of your literature searchReviewing and refining the results of your literature search
Reviewing and refining the results of your literature search
 
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...
OBOPedia: An Encyclopaedia of Biology Using OBO OntologiesObopedia swat4ls-20...
 
Introduction to Grey literature for Health Sciences
Introduction to Grey literature for Health SciencesIntroduction to Grey literature for Health Sciences
Introduction to Grey literature for Health Sciences
 
Validata: A tool for testing profile conformance
Validata: A tool for testing profile conformanceValidata: A tool for testing profile conformance
Validata: A tool for testing profile conformance
 

Similar to Medical Data Management with Pyradigm

Design and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRPablo Pazos
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 Scott Edmunds
 
Data Management for Graduate Students
Data Management for Graduate StudentsData Management for Graduate Students
Data Management for Graduate StudentsRebekah Cummings
 
Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Mojtaba Lotfaliany
 
Data Management for Undergraduate Research
Data Management for Undergraduate ResearchData Management for Undergraduate Research
Data Management for Undergraduate ResearchRebekah Cummings
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetCongChen35
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath scienceMitikuTeka1
 
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014Susanna-Assunta Sansone
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EITESANGO
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...SEAD
 
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...Susanna-Assunta Sansone
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph SchemaJoshua Shinavier
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014Susanna-Assunta Sansone
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital AgeJ T "Tom" Johnson
 

Similar to Medical Data Management with Pyradigm (20)

Design and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHRDesign and implementation of Clinical Databases using openEHR
Design and implementation of Clinical Databases using openEHR
 
Machine Learning & Apache Mahout
Machine Learning & Apache MahoutMachine Learning & Apache Mahout
Machine Learning & Apache Mahout
 
HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9 HKU Data Curation MLIM7350 Class 9
HKU Data Curation MLIM7350 Class 9
 
Data Management for Graduate Students
Data Management for Graduate StudentsData Management for Graduate Students
Data Management for Graduate Students
 
Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing Best Practice in Data Management and Sharing
Best Practice in Data Management and Sharing
 
Data Management for Undergraduate Research
Data Management for Undergraduate ResearchData Management for Undergraduate Research
Data Management for Undergraduate Research
 
Metadata
MetadataMetadata
Metadata
 
1 d.1
1 d.11 d.1
1 d.1
 
The Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer DatasetThe Simulacrum, a Synthetic Cancer Dataset
The Simulacrum, a Synthetic Cancer Dataset
 
Epidata presentation course for heath science
Epidata presentation course for heath scienceEpidata presentation course for heath science
Epidata presentation course for heath science
 
Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?Where do we currently stand at ICARDA?
Where do we currently stand at ICARDA?
 
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
High quality data publications: drives and needs - Sansone, BDebate, 12 Nov 2014
 
Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...Our dire need to mandate data standards and expectations for scientific publi...
Our dire need to mandate data standards and expectations for scientific publi...
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...Data Sets, Ensemble Cloud Computing, and the University Library:Getting the ...
Data Sets, Ensemble Cloud Computing, and the University Library: Getting the ...
 
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
SciDataCon 2014 Data Papers and their applications workshop - NPG Scientific ...
 
Evolution of the Graph Schema
Evolution of the Graph SchemaEvolution of the Graph Schema
Evolution of the Graph Schema
 
Dbms rlde.ppt
Dbms rlde.pptDbms rlde.ppt
Dbms rlde.ppt
 
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
NPG Scientific Data - Metabolomics Society meeting, Tsuruola, Japan, 2014
 
Data validation in the Digital Age
Data validation in the Digital AgeData validation in the Digital Age
Data validation in the Digital Age
 

Recently uploaded

Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingNetHelix
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubaikojalkojal131
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxSimeonChristian
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems
 

Recently uploaded (20)

Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editingBase editing, prime editing, Cas13 & RNA editing and organelle base editing
Base editing, prime editing, Cas13 & RNA editing and organelle base editing
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Bioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptxBioteknologi kelas 10 kumer smapsa .pptx
Bioteknologi kelas 10 kumer smapsa .pptx
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In DubaiDubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
Dubai Calls Girl Lisa O525547819 Lexi Call Girls In Dubai
 
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptxGood agricultural practices 3rd year bpharm. herbal drug technology .pptx
Good agricultural practices 3rd year bpharm. herbal drug technology .pptx
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
User Guide: Pulsar™ Weather Station (Columbia Weather Systems)
 

Medical Data Management with Pyradigm

  • 1. Research data management for medical data with pyradigm Pradeep Reddy Raamana crossinvalidation.com github.com/raamana
  • 7. Dataset Lifecyle in ML Input or RAW data on the disk folder hierarchy meta data etc Intermediate 1 various types outputs 1 widely varying formats Intermediate 2 diversity of needs diversity of users Outputs 2 Outputs 3 3
  • 8. Dataset Lifecyle in ML Input or RAW data on the disk folder hierarchy meta data etc Intermediate 1 various types outputs 1 widely varying formats Intermediate 2 diversity of needs diversity of users Outputs 2 Outputs 3 3
  • 9. Challenges in RDM for Medical Data 4 Too many tables to manage even for a small project!
  • 10. Challenges in RDM for Medical Data 4
  • 11. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types 4
  • 12. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs 4
  • 13. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data 4
  • 14. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all 4
  • 15. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training 4
  • 16. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training • Recipe for disaster • Features can get mixed up easily • Having to look into multiple scripts and word documents to figure where is what, what they mean, and whether they all properly linked is a nightmare! 4
  • 17. Challenges in RDM for Medical Data • Mixed data types • Primary & secondary features • Several attributes of mixed types • Multiple tables need to be linked and integrated with unique IDs • Provenance needs to be captured • Processing steps • Their meta data • Ad hoc scripts to read and manage CSVs do not work at all • Frequent change of hands • Students, RAs, Staff etc • With limited training • Recipe for disaster • Features can get mixed up easily • Having to look into multiple scripts and word documents to figure where is what, what they mean, and whether they all properly linked is a nightmare! • Library built to reduce my own pain! 4
  • 18. Need for accessibility and domain adaptation 5
  • 19. Need for accessibility and domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use 5
  • 20. Need for accessibility and domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use • Domain adaptation • There is always few domain-specific minor issues we need to handle • Preprocessing, naming, validation etc • Diversity of data types • Hashable ID : integers, alphanumeric etc • Features : simple vector of numbers, or more structured data like graphs, trees • Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc) 5
  • 21. Need for accessibility and domain adaptation • Existing libraries for data table e.g. pandas etc add a big barrier • Cognitive burden • Too contrived • Terminology misuse, mistaken use • Domain adaptation • There is always few domain-specific minor issues we need to handle • Preprocessing, naming, validation etc • Diversity of data types • Hashable ID : integers, alphanumeric etc • Features : simple vector of numbers, or more structured data like graphs, trees • Targets : integers, categorical (health vs. disease), multi-output (disease 1 AND disease 2 etc) • Trying to reduce “data entropy” in key parts of RDM • Reminder: Data lasts MUCH LONGER than the project itself! 5
  • 22. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etc X
  • 23. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etc X ktargets y
  • 24. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y
  • 25. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification
  • 26. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y mattributes A continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification
  • 27. Core structure of pyradigm 6 p features ! Nsamplets! • Common predictive modeling, machine learning and biomarker workflows need to deal with • multiple types of data • features • targets • confounds • multiple data types • numerical, categorical • scalar, vector • potentially differing in length • This is challenging and erroneous! • esp. when multiple experiments and comparisons are performed e.g. across different sub-groups, different targets, different covariate regressions etclink between diverse types of data for the same ID / hash / subject X ktargets y mattributes A continuous: score, severity, age etc ! regression categorical (healthy vs disease) ! classification covariates or confounds such as age, gender, site, Usually scalars, but sometimes vectors too!
  • 30. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of 7
  • 31. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats 7
  • 32. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical 7
  • 33. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) 7
  • 34. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) • RegressionDataset • Target: continuous float value : disease severity score, age etc 7
  • 35. Implementation details • BaseDataset • Abstract base class, defining the coarse structure and properties • A collection of hashable IDs (dict keys) • each ID expecting data and a target • and an optional set of attributes • Validation: different types of • Methods • Add, summarize, retrieve, delete • Arithmetic: combine, transform etc • Sampling: by target values, by attribute properties, or randomly • Exporting to different formats • Few derived classes • Specific conditions on target properties, such as whether it is categorical or numerical • ClassificationDataset • Target : often a string (healthy, disease), or an integer (-1, 1, 2) • RegressionDataset • Target: continuous float value : disease severity score, age etc • Many other possibilities • Depending on domain and use-case 7
  • 36. Implementation details contd. • Classes and data are managed via dict of dicts • Convenience for developers • Current setup is fine for our domain: • 1000s of rows, few 1000s of columns • Larger and more complex domains need fine-tuning • Serialization • Pickle files • HDF etc are possible • Requesting help from contributors 8
  • 38. Usage 9 • Once it is built by someone, no one else has to worry about it. • They can slice and dice it in any number of ways they desire • You don’t need the script that built this data structure, as it’s more or less self-explanatory
  • 39. Dataset iteration & arithmetic 10
  • 40. Dataset iteration & arithmetic 10 • Lot more intuitive! • Higher-level organization, not rows, columns and comments! • Meta-data gets propagated automatically ! life is much more productive! • Achieving this with CSVs is a huge pain!
  • 42. Advantages • Intuitive for niche domains: easy to use and teach 11
  • 43. Advantages • Intuitive for niche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! 11
  • 44. Advantages • Intuitive for niche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! • Errors are caught early! • Instead of much later e.g. using some other toolbox and then having to painfully trace it back 11
  • 45. Advantages • Intuitive for niche domains: easy to use and teach • Continuous validation • As part of .add_samplet() or .add_attr() etc • infinite or invalid or unexpected values • Duplicate rows, columns of all 0s • Allows arbitrary user-defined or domain-specific checks! • Errors are caught early! • Instead of much later e.g. using some other toolbox and then having to painfully trace it back • Improves integrity 11
  • 46. Advanced use cases: MultiDataset 12 Nsamplets! k targets y
  • 47. Advanced use cases: MultiDataset 12 Nsamplets! k targets y mattributes A
  • 48. Advanced use cases: MultiDataset 12 Nsamplets! k targets y mattributes A p1features! X1
  • 49. Advanced use cases: MultiDataset 12 Nsamplets! k targets y p2 features X2 mattributes A p1features! X1
  • 50. Advanced use cases: MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 mattributes A p1features! X1
  • 51. Advanced use cases: MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 X4 p4 features mattributes A p1features! X1
  • 52. Advanced use cases: MultiDataset 12 Nsamplets! k targets y p2 features X2 p3 X3 X4 p4 features mattributes A p1features! X1
  • 53. Thank you • Check it out here github.com/raamana 
 • Follow me @ twitter.com/raamana_ • Contributors most welcome. 13