SlideShare a Scribd company logo
1 of 41
Download to read offline
AI, electronic records, health
Gaël Varoquaux
and why machine-learning on dirty data
opens new doors
1 From clinical studies to electronic
health records
G Varoquaux 2
1 Clinical studies are hard
Covid 19 vaccines
Moderna:
- 2 months lab development
- 8 months clinical studies for approval
30 000 volunteers
Pfizer–BioNTech:
- 3 months lab development
- 7 months clinical studies for approval
10 000 volunteers
Experimentation on humans (slow cycles, high risks)
Conclusion across individual heterogeneity
G Varoquaux 3
1 Real-life evidence versus clinical trials
Do vaccines prevent spread?
Question ill-suited to intervention & requiring huge samples
Is the Astrazeneca vaccine applicable to people above 65 years-old?
Fragile people (elderly) were excluded from the clinical trial
An external validity problem [Colnet... 2020]
Evidence from real-world observational data: following
individuals as they get, or not, the treatment [Dagan... 2021].
G Varoquaux 4
1 Electronic Health Records – source of real-life data
Patient records (anything available, really)
Claims databases, accounting, measurement history, doctors’ notes
Great longitudinal coverage
AP-HP (Paris hospitals)
39 hospitals
8 millions patients a year
Great population coverage
Free data
G Varoquaux 5
1 Electronic Health Records: dirty data challenges
Missing values
Uneven data on patients, across hospital sites
Data not measured because not applicable, no time in face of urgency...
Much larger rate of missingness than in clinical studies (often 80%)
G Varoquaux 6
1 Electronic Health Records: dirty data challenges
Missing values
Uneven data on patients, across hospital sites
Data not measured because not applicable, no time in face of urgency...
Much larger rate of missingness than in clinical studies (often 80%)
Non normalized information
Manual input, different conventions
“Diabetes Type 2” — “Diabetes Mellitus, Type 2” — “DM2”
G Varoquaux 6
1 Electronic Health Records: observational data 6= experiments
Treated & non treated patients
are not comparable
Naive conclusions
on treatment efficacy
G Varoquaux 7
1 Electronic Health Records: observational data 6= experiments
Treated & non treated patients
are not comparable
Naive conclusions
on treatment efficacy
Causal inference techniques
Settings
- Treatment T (∈ {0, 1})
- Outcome Y
Potential outcome Y (T) (treated or not)
- Covariates X (condition of patient)
Need unconfoundedness
{Y (1), Y (0)}
|=
T | X
Potential outcomes Y of patients do not
depend on whether they have really
been treated or not
Accounting for covariates to
compensate for differences
G Varoquaux 7
1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
G Varoquaux 8
1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
Reweighting
Match the covariate distribution
of treated and non treated
Treated Non treated
Learn P(T|X) with an “AI”
G Varoquaux 8
1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
Reweighting
Match the covariate distribution
of treated and non treated
Treated Non treated
Learn P(T|X) with an “AI”
More generally:
machine learning as non-parametric statistical estimator
G Varoquaux 8
1 Remaining agenda: Machine learning can model this “dirty data”
1 From clinical studies to electronic health records
2 Learning on non-normalized data
3 Learning with missing values
G Varoquaux 9
2 Learning on non-normalized data
[Cerda... 2018, Cerda and Varoquaux 2020]
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Data expressed with categories
in non-standardized form
“Dirty categories”
G Varoquaux 10
2 Dirty categories break standard statistical practice
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
OneHotEncoder not suitable
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 11
2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 12
2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult
without
supervision
G Varoquaux 12
2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Our view: supervised learning on dirty categories
The statistical question should inform curation
Pfizer Corporation Hong Kong =
? Pfizer Pharmaceuticals Korea
G Varoquaux 12
2 Simple fix: Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
X ∈ Rn×p
new categories
link categories
string distance(Londres, London)
G Varoquaux 13
2 Simple fix: Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
X ∈ Rn×p
new categories
link categories
string distance(Londres, London)
= Prototype methods
How to choose a small number of prototypes?
The right prototypes may not be in training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 13
2 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 14
2 GaP Encoder, a latent category model [Cerda and Varoquaux 2020]
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Model strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 15
2 String models of latent categories [Cerda and Varoquaux 2020]
Encodings
that extract
latent
categories
,
l
i
b
r
a
r
y
o
p
e
r
a
t
o
r
p
e
c
i
a
l
i
s
t
w
a
r
e
h
o
u
s
e
,
m
a
n
a
g
e
r
c
o
m
m
u
n
i
t
y
r
,
r
e
s
c
u
e
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
e
n
a
m
e
s
Categories
G Varoquaux 16
2 String models of latent categories [Cerda and Varoquaux 2020]
Inferring
plausible
feature
names
a
s
s
i
s
t
a
n
t
,
l
i
b
r
a
r
y
e
q
u
i
p
m
e
n
t
,
o
p
e
r
a
t
o
r
t
r
a
t
i
o
n
,
s
p
e
c
i
a
l
i
s
t
t
s
w
o
r
k
e
r
,
w
a
r
e
h
o
u
s
e
g
,
p
r
o
g
r
a
m
,
m
a
n
a
g
e
r
m
e
c
h
a
n
i
c
,
c
o
m
m
u
n
i
t
y
e
r
,
r
e
s
c
u
e
r
,
r
e
s
c
u
e
c
o
r
r
e
c
t
i
o
n
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
r
r
e
d
f
e
a
t
u
r
e
n
a
m
e
s
Categories
G Varoquaux 16
2 Un-blackboxify: Data science with dirty categories
Retrieving insight from machine-learning on non-curated data
Feature importances
Given a fitted model:
from s k l e a r n . i n s p e c t i o n import p e r m u t a t i o n i m p o r t a n c e
r = p e r m u t a t i o n i m p o r t a n c e (model , X val , y v a l ,
n r e p e a t s =30,
r a n d o m s t a t e =0)
G Varoquaux 17
[Cerda and Varoquaux 2020]
2 Un-blackboxify: Data science with dirty categories
What characteristics of an employee are important to explain salary?
0.0 0.1 0.2
Information, Technology, Technologist
Officer, Office, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 17
[Cerda and Varoquaux 2020]
2 Dirty categories in practice
Software
DirtyCat: Dirty category software
http://dirty-cat.github.io
from d i r t y c a t import GapEncoder
g a p e n c o d e r = GapEncoder ()
t r a n s f o r m e d v a l u e s = g a p e n c o d e r . f i t t r a n s f o r m ( df )
Practical tip
Gradient-boosted trees work very well on tabular data
sklearn.ensemble.HistGradientBoostingRegressor
[Cerda... 2018, Cerda and Varoquaux 2020]
G Varoquaux 18
3 Learning with missing values
[Josse... 2019]
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
G Varoquaux 19
3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can
be obtained on observed data while ignoring the unobserved values.
Justification for imputation of missing values
G Varoquaux 20
3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can
be obtained on observed data while ignoring the unobserved values.
Justification for imputation of missing values
Missing Not at Random situation (MNAR)
Missingness not ignorable
Hard: need model of missing-values mechanism
G Varoquaux 20
3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Missing Not at Random situation (MNAR)
Missingness not ignorable
Hard: need model of missing-values mechanism
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MAR
2 0 2
2
0
2
MNAR
G Varoquaux 20
3 Supervised learning with missing values
Difficulties
Half-discrete input space (NA ∪ R)
Complex predictor even in simple settings (linear + MAR)
[Le Morvan... 2020b]
Y = β?
1X1 + β?
2X2 + β?
0
cor(X1, X2) = 0.5.
If X2 is missing, the coefficient
of X1 should compensate for
the missingness of X2.
up to 2d
set of slopes
effect of X2lost effect of X2
accounted for by
X1
G Varoquaux 21
3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
G Varoquaux 22
3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less samples to approximate well and predict well
G Varoquaux 22
3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less samples to approximate well and predict well
Also suitable for MNAR settings
G Varoquaux 22
AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
Dirty categories
Non-normalized data
Latent categories via string forms
Dirty category software:
http://dirty-cat.github.io
AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
Dirty categories
Latent categories via string forms
Dirty category software:
http://dirty-cat.github.io
Supervised learning with missing data
Also suitable for MNAR
Broader picture: supervised learning without cleaning
http://project.inria.fr/dirtydata
Acknowledgements
Dirty categories
Patricio Cerda and Balazs Kegl
Missing data
Julie Josse, Erwan Scornet, Marine Le Morvan, Nicolas Prost
Electronic Health records
AP-HP, Alexandre Gramfort, Marc Lavielle, Lihu Chen,
Fabian Suchanek, Thomas Moreau, Antoine Neuraz...
4 References I
P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical
variables. IEEE Transactions on Knowledge and Data Engineering, 2020.
P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty
categorical variables. Machine learning, 2018.
B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse,
and S. Yang. Causal inference methods for combining randomized trials and
observational studies: a review. arXiv preprint arXiv:2011.08047, 2020.
N. Dagan, N. Barda, E. Kepten, O. Miron, S. Perchik, M. A. Katz, M. A. Hernán,
M. Lipsitch, B. Reis, and R. D. Balicer. Bnt162b2 mrna covid-19 vaccine in a
nationwide mass vaccination setting. New England Journal of Medicine, 2021.
M. J. Funk, D. Westreich, C. Wiesen, T. Stürmer, M. A. Brookhart, and
M. Davidian. Doubly robust estimation of causal effects. American journal of
epidemiology, 173(7):761–767, 2011.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of
supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
4 References II
M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss
networks: differential programming for supervised learning with missing values.
In Advances in Neural Information Processing Systems 33, 2020a.
M. Le Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear
predictor on linearly-generated data with missing values: non consistency and
solutions. AISTATS, 2020b.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.

More Related Content

More from Gael Varoquaux

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesGael Varoquaux
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomesGael Varoquaux
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingGael Varoquaux
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Gael Varoquaux
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingGael Varoquaux
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingGael Varoquaux
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible scienceGael Varoquaux
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovationGael Varoquaux
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsGael Varoquaux
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataGael Varoquaux
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsGael Varoquaux
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsityGael Varoquaux
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Gael Varoquaux
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIGael Varoquaux
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsGael Varoquaux
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectGael Varoquaux
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in PythonGael Varoquaux
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareGael Varoquaux
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetGael Varoquaux
 

More from Gael Varoquaux (20)

Similarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variablesSimilarity encoding for learning on dirty categorical variables
Similarity encoding for learning on dirty categorical variables
 
Machine learning for functional connectomes
Machine learning for functional connectomesMachine learning for functional connectomes
Machine learning for functional connectomes
 
Towards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imagingTowards psychoinformatics with machine learning and brain imaging
Towards psychoinformatics with machine learning and brain imaging
 
Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities Simple representations for learning: factorizations and similarities
Simple representations for learning: factorizations and similarities
 
A tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imagingA tutorial on Machine Learning, with illustrations for MR imaging
A tutorial on Machine Learning, with illustrations for MR imaging
 
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imagingScikit-learn and nilearn: Democratisation of machine learning for brain imaging
Scikit-learn and nilearn: Democratisation of machine learning for brain imaging
 
Computational practices for reproducible science
Computational practices for reproducible scienceComputational practices for reproducible science
Computational practices for reproducible science
 
Coding for science and innovation
Coding for science and innovationCoding for science and innovation
Coding for science and innovation
 
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and LimitationsEstimating Functional Connectomes: Sparsity’s Strength and Limitations
Estimating Functional Connectomes: Sparsity’s Strength and Limitations
 
On the code of data science
On the code of data scienceOn the code of data science
On the code of data science
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 
Machine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questionsMachine learning and cognitive neuroimaging: new tools can answer new questions
Machine learning and cognitive neuroimaging: new tools can answer new questions
 
Social-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsitySocial-sparsity brain decoders: faster spatial sparsity
Social-sparsity brain decoders: faster spatial sparsity
 
Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016Scikit-learn: the state of the union 2016
Scikit-learn: the state of the union 2016
 
Inter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRIInter-site autism biomarkers from resting state fMRI
Inter-site autism biomarkers from resting state fMRI
 
Brain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizationsBrain maps from machine learning? Spatial regularizations
Brain maps from machine learning? Spatial regularizations
 
Scikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the projectScikit-learn for easy machine learning: the vision, the tool, and the project
Scikit-learn for easy machine learning: the vision, the tool, and the project
 
Simple big data, in Python
Simple big data, in PythonSimple big data, in Python
Simple big data, in Python
 
Succeeding in academia despite doing good_software
Succeeding in academia despite doing good_softwareSucceeding in academia despite doing good_software
Succeeding in academia despite doing good_software
 
Building a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budgetBuilding a cutting-edge data processing environment on a budget
Building a cutting-edge data processing environment on a budget
 

Recently uploaded

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfrs7054576148
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 

Recently uploaded (20)

Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Intro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdfIntro To Electric Vehicles PDF Notes.pdf
Intro To Electric Vehicles PDF Notes.pdf
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 

AI, electronic records, and health

  • 1. AI, electronic records, health Gaël Varoquaux and why machine-learning on dirty data opens new doors
  • 2. 1 From clinical studies to electronic health records G Varoquaux 2
  • 3. 1 Clinical studies are hard Covid 19 vaccines Moderna: - 2 months lab development - 8 months clinical studies for approval 30 000 volunteers Pfizer–BioNTech: - 3 months lab development - 7 months clinical studies for approval 10 000 volunteers Experimentation on humans (slow cycles, high risks) Conclusion across individual heterogeneity G Varoquaux 3
  • 4. 1 Real-life evidence versus clinical trials Do vaccines prevent spread? Question ill-suited to intervention & requiring huge samples Is the Astrazeneca vaccine applicable to people above 65 years-old? Fragile people (elderly) were excluded from the clinical trial An external validity problem [Colnet... 2020] Evidence from real-world observational data: following individuals as they get, or not, the treatment [Dagan... 2021]. G Varoquaux 4
  • 5. 1 Electronic Health Records – source of real-life data Patient records (anything available, really) Claims databases, accounting, measurement history, doctors’ notes Great longitudinal coverage AP-HP (Paris hospitals) 39 hospitals 8 millions patients a year Great population coverage Free data G Varoquaux 5
  • 6. 1 Electronic Health Records: dirty data challenges Missing values Uneven data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency... Much larger rate of missingness than in clinical studies (often 80%) G Varoquaux 6
  • 7. 1 Electronic Health Records: dirty data challenges Missing values Uneven data on patients, across hospital sites Data not measured because not applicable, no time in face of urgency... Much larger rate of missingness than in clinical studies (often 80%) Non normalized information Manual input, different conventions “Diabetes Type 2” — “Diabetes Mellitus, Type 2” — “DM2” G Varoquaux 6
  • 8. 1 Electronic Health Records: observational data 6= experiments Treated & non treated patients are not comparable Naive conclusions on treatment efficacy G Varoquaux 7
  • 9. 1 Electronic Health Records: observational data 6= experiments Treated & non treated patients are not comparable Naive conclusions on treatment efficacy Causal inference techniques Settings - Treatment T (∈ {0, 1}) - Outcome Y Potential outcome Y (T) (treated or not) - Covariates X (condition of patient) Need unconfoundedness {Y (1), Y (0)} |= T | X Potential outcomes Y of patients do not depend on whether they have really been treated or not Accounting for covariates to compensate for differences G Varoquaux 7
  • 10. 1 AI for causal inference [Funk... 2011] Unconfoundedness {Y (1), Y (0)} |= T | X Outcome regression Model Y (T) = f (X, T) f can be learned with an “AI” (statistical machine learning) G Varoquaux 8
  • 11. 1 AI for causal inference [Funk... 2011] Unconfoundedness {Y (1), Y (0)} |= T | X Outcome regression Model Y (T) = f (X, T) f can be learned with an “AI” (statistical machine learning) Reweighting Match the covariate distribution of treated and non treated Treated Non treated Learn P(T|X) with an “AI” G Varoquaux 8
  • 12. 1 AI for causal inference [Funk... 2011] Unconfoundedness {Y (1), Y (0)} |= T | X Outcome regression Model Y (T) = f (X, T) f can be learned with an “AI” (statistical machine learning) Reweighting Match the covariate distribution of treated and non treated Treated Non treated Learn P(T|X) with an “AI” More generally: machine learning as non-parametric statistical estimator G Varoquaux 8
  • 13. 1 Remaining agenda: Machine learning can model this “dirty data” 1 From clinical studies to electronic health records 2 Learning on non-normalized data 3 Learning with missing values G Varoquaux 9
  • 14. 2 Learning on non-normalized data [Cerda... 2018, Cerda and Varoquaux 2020] Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Data expressed with categories in non-standardized form “Dirty categories” G Varoquaux 10
  • 15. 2 Dirty categories break standard statistical practice Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I OneHotEncoder not suitable Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 11
  • 16. 2 Standard approach: data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III Police Officer II Social Worker II Police Officer III ⇒ Position Rank Police Officer Master Social Worker III Police Officer II Social Worker II Police Officer III G Varoquaux 12
  • 17. 2 Standard approach: data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC Pfizer International LLC Pfizer Limited Pfizer Corporation Hong Kong Limited Pfizer Pharmaceuticals Korea Limited ... Difficult without supervision G Varoquaux 12
  • 18. 2 Standard approach: data curation Database normalization Feature engineering Employee Position Title Master Police Officer Social Worker III ... ⇒ Position Rank Police Officer Master Social Worker III ... Merging entities Deduplication & record linkage Output a “clean” database Company name Pfizer Inc. Pfizer Pharmaceuticals LLC ... Hard to make automatic and turn-key Our view: supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong = ? Pfizer Pharmaceuticals Korea G Varoquaux 12
  • 19. 2 Simple fix: Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 X ∈ Rn×p new categories link categories string distance(Londres, London) G Varoquaux 13
  • 20. 2 Simple fix: Adding similarities to one-hot encoding One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 Similarity encoding [Cerda... 2018] London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 X ∈ Rn×p new categories link categories string distance(Londres, London) = Prototype methods How to choose a small number of prototypes? The right prototypes may not be in training set “big cat” “fat cat” “big dog” “fat dog” Estimate prototypes G Varoquaux 13
  • 21. 2 Substring information Drug Name alcohol ethyl alcohol isopropyl alcohol polyvinyl alcohol isopropyl alcohol swab 62% ethyl alcohol alcohol 68% alcohol denat benzyl alcohol dehydrated alcohol Employee Position Title Police Aide Master Police Officer Mechanic Technician II Police Officer III Senior Architect Senior Engineer Technician Social Worker III G Varoquaux 14
  • 22. 2 GaP Encoder, a latent category model [Cerda and Varoquaux 2020] Topic model on sub-strings (GaP: Gamma-Poisson factorization) | {z } 3-gram1 L | {z } 3-gram2 on |{z} 3-gram3 do... Model strings as a linear combination of substrings 11111000000000 00000011111111 10000001100000 11100000000000 11111100000000 11111000000000 police officer pol off polis policeman policier e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l → 03078090707907 00790752700578 94071006000797 topics 030 007 940 009 100 000 documents topics + What substrings are in a latent category What latent categories are in an entry e r _ c e r f i c o f f _ o f c e _ i c e l i c p o l G Varoquaux 15
  • 23. 2 String models of latent categories [Cerda and Varoquaux 2020] Encodings that extract latent categories , l i b r a r y o p e r a t o r p e c i a l i s t w a r e h o u s e , m a n a g e r c o m m u n i t y r , r e s c u e , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant e n a m e s Categories G Varoquaux 16
  • 24. 2 String models of latent categories [Cerda and Varoquaux 2020] Inferring plausible feature names a s s i s t a n t , l i b r a r y e q u i p m e n t , o p e r a t o r t r a t i o n , s p e c i a l i s t t s w o r k e r , w a r e h o u s e g , p r o g r a m , m a n a g e r m e c h a n i c , c o m m u n i t y e r , r e s c u e r , r e s c u e c o r r e c t i o n , o f f i c e r Legislative Analyst II Legislative Attorney Equipment Operator I Transit Coordinator Bus Operator Senior Architect Senior Engineer Technician Financial Programs Manager Capital Projects Manager Mechanic Technician II Master Police Officer Police Sergeant r r e d f e a t u r e n a m e s Categories G Varoquaux 16
  • 25. 2 Un-blackboxify: Data science with dirty categories Retrieving insight from machine-learning on non-curated data Feature importances Given a fitted model: from s k l e a r n . i n s p e c t i o n import p e r m u t a t i o n i m p o r t a n c e r = p e r m u t a t i o n i m p o r t a n c e (model , X val , y v a l , n r e p e a t s =30, r a n d o m s t a t e =0) G Varoquaux 17 [Cerda and Varoquaux 2020]
  • 26. 2 Un-blackboxify: Data science with dirty categories What characteristics of an employee are important to explain salary? 0.0 0.1 0.2 Information, Technology, Technologist Officer, Office, Police Liquor, Clerk, Store School, Health, Room Environmental, Telephone, Capital Lieutenant, Captain, Chief Income, Assistance, Compliance Manager, Management, Property Inferred feature names Permutation Importances G Varoquaux 17 [Cerda and Varoquaux 2020]
  • 27. 2 Dirty categories in practice Software DirtyCat: Dirty category software http://dirty-cat.github.io from d i r t y c a t import GapEncoder g a p e n c o d e r = GapEncoder () t r a n s f o r m e d v a l u e s = g a p e n c o d e r . f i t t r a n s f o r m ( df ) Practical tip Gradient-boosted trees work very well on tabular data sklearn.ensemble.HistGradientBoostingRegressor [Cerda... 2018, Cerda and Varoquaux 2020] G Varoquaux 18
  • 28. 3 Learning with missing values [Josse... 2019] Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F NA Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I M 04/28/2002 Bus Operator M NA Bus Operator F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M NA Library Assistant I G Varoquaux 19
  • 29. 3 Classic statistics missing-values framework [Josse... 2019] Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) Probability of missingness does not depend on unobserved values. Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can be obtained on observed data while ignoring the unobserved values. Justification for imputation of missing values G Varoquaux 20
  • 30. 3 Classic statistics missing-values framework [Josse... 2019] Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) Probability of missingness does not depend on unobserved values. Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can be obtained on observed data while ignoring the unobserved values. Justification for imputation of missing values Missing Not at Random situation (MNAR) Missingness not ignorable Hard: need model of missing-values mechanism G Varoquaux 20
  • 31. 3 Classic statistics missing-values framework [Josse... 2019] Model a) a complete data-generating process Model b) a random process occluding entries Missing at random situation (MAR) Probability of missingness does not depend on unobserved values. Missing Not at Random situation (MNAR) Missingness not ignorable Hard: need model of missing-values mechanism 2 0 2 2 0 2 Complete 2 0 2 2 0 2 MAR 2 0 2 2 0 2 MNAR G Varoquaux 20
  • 32. 3 Supervised learning with missing values Difficulties Half-discrete input space (NA ∪ R) Complex predictor even in simple settings (linear + MAR) [Le Morvan... 2020b] Y = β? 1X1 + β? 2X2 + β? 0 cor(X1, X2) = 0.5. If X2 is missing, the coefficient of X1 should compensate for the missingness of X2. up to 2d set of slopes effect of X2lost effect of X2 accounted for by X1 G Varoquaux 21
  • 33. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a] Derive theoretical forms of optimal predictors Approximate them with functions learnable by neural networks Taylored architecture which learns all slopes jointly G Varoquaux 22
  • 34. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a] Derive theoretical forms of optimal predictors Approximate them with functions learnable by neural networks Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less samples to approximate well and predict well G Varoquaux 22
  • 35. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a] Derive theoretical forms of optimal predictors Approximate them with functions learnable by neural networks Taylored architecture which learns all slopes jointly 103 104 Number of parameters 0.00 −0.05 −0.10 R2 score - Bayes rate MLP Deep MLP Wide NeuMiss Test set Train set Network depth 1 3 5 7 9 width 1 d 3 d 10 d 30 d 50 d NeuMiss needs less samples to approximate well and predict well Also suitable for MNAR settings G Varoquaux 22
  • 36. AI, electronic health records, health Electronic health records open new doors for cheaper studies AI provides statistical estimators and information extraction
  • 37. AI, electronic health records, health Electronic health records open new doors for cheaper studies AI provides statistical estimators and information extraction Dirty categories Non-normalized data Latent categories via string forms Dirty category software: http://dirty-cat.github.io
  • 38. AI, electronic health records, health Electronic health records open new doors for cheaper studies AI provides statistical estimators and information extraction Dirty categories Latent categories via string forms Dirty category software: http://dirty-cat.github.io Supervised learning with missing data Also suitable for MNAR Broader picture: supervised learning without cleaning http://project.inria.fr/dirtydata
  • 39. Acknowledgements Dirty categories Patricio Cerda and Balazs Kegl Missing data Julie Josse, Erwan Scornet, Marine Le Morvan, Nicolas Prost Electronic Health records AP-HP, Alexandre Gramfort, Marc Lavielle, Lihu Chen, Fabian Suchanek, Thomas Moreau, Antoine Neuraz...
  • 40. 4 References I P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 2020. P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty categorical variables. Machine learning, 2018. B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse, and S. Yang. Causal inference methods for combining randomized trials and observational studies: a review. arXiv preprint arXiv:2011.08047, 2020. N. Dagan, N. Barda, E. Kepten, O. Miron, S. Perchik, M. A. Katz, M. A. Hernán, M. Lipsitch, B. Reis, and R. D. Balicer. Bnt162b2 mrna covid-19 vaccine in a nationwide mass vaccination setting. New England Journal of Medicine, 2021. M. J. Funk, D. Westreich, C. Wiesen, T. Stürmer, M. A. Brookhart, and M. Davidian. Doubly robust estimation of causal effects. American journal of epidemiology, 173(7):761–767, 2011. J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
  • 41. 4 References II M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss networks: differential programming for supervised learning with missing values. In Advances in Neural Information Processing Systems 33, 2020a. M. Le Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear predictor on linearly-generated data with missing values: non consistency and solutions. AISTATS, 2020b. D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.