Digital medecine is often associated with high-end biomedical sensor. Health research is typically performed on carefully curated data from closely-followed cohorts. And yet, the most complete source of data on health of populations is the mondane book-keeping of hospitals, recording prescriptions and observations. When the covid-19 pandemics stroke Paris, such data enabled us to understand in real time this emerging disease. It is "dirty", full of missing values, incorrectly-typed entries, and other biases. For these reasons, it is frowned upon as a source for medical evidence and practice, despite being available at no cost. AI, or rather machine learning, can provide powerful statistical models to extract and analysis this wealth of information.
Drawing from some lessons learned analysing electronic health records during the covid crisis, I will discuss how progress in machine learning on dirty data can enable new data practices. In particular, I will show how machine learning models can gracefully deal with missing values; and how they can model categories in the data with typos and morphological variants. The corresponding statistical models will build upon classical statistical practice, to provide well-grounded machine-learning tools, including new neural network architectures.
1. AI, electronic records, health
Gaël Varoquaux
and why machine-learning on dirty data
opens new doors
2. 1 From clinical studies to electronic
health records
G Varoquaux 2
3. 1 Clinical studies are hard
Covid 19 vaccines
Moderna:
- 2 months lab development
- 8 months clinical studies for approval
30 000 volunteers
Pfizer–BioNTech:
- 3 months lab development
- 7 months clinical studies for approval
10 000 volunteers
Experimentation on humans (slow cycles, high risks)
Conclusion across individual heterogeneity
G Varoquaux 3
4. 1 Real-life evidence versus clinical trials
Do vaccines prevent spread?
Question ill-suited to intervention & requiring huge samples
Is the Astrazeneca vaccine applicable to people above 65 years-old?
Fragile people (elderly) were excluded from the clinical trial
An external validity problem [Colnet... 2020]
Evidence from real-world observational data: following
individuals as they get, or not, the treatment [Dagan... 2021].
G Varoquaux 4
5. 1 Electronic Health Records – source of real-life data
Patient records (anything available, really)
Claims databases, accounting, measurement history, doctors’ notes
Great longitudinal coverage
AP-HP (Paris hospitals)
39 hospitals
8 millions patients a year
Great population coverage
Free data
G Varoquaux 5
6. 1 Electronic Health Records: dirty data challenges
Missing values
Uneven data on patients, across hospital sites
Data not measured because not applicable, no time in face of urgency...
Much larger rate of missingness than in clinical studies (often 80%)
G Varoquaux 6
7. 1 Electronic Health Records: dirty data challenges
Missing values
Uneven data on patients, across hospital sites
Data not measured because not applicable, no time in face of urgency...
Much larger rate of missingness than in clinical studies (often 80%)
Non normalized information
Manual input, different conventions
“Diabetes Type 2” — “Diabetes Mellitus, Type 2” — “DM2”
G Varoquaux 6
8. 1 Electronic Health Records: observational data 6= experiments
Treated & non treated patients
are not comparable
Naive conclusions
on treatment efficacy
G Varoquaux 7
9. 1 Electronic Health Records: observational data 6= experiments
Treated & non treated patients
are not comparable
Naive conclusions
on treatment efficacy
Causal inference techniques
Settings
- Treatment T (∈ {0, 1})
- Outcome Y
Potential outcome Y (T) (treated or not)
- Covariates X (condition of patient)
Need unconfoundedness
{Y (1), Y (0)}
|=
T | X
Potential outcomes Y of patients do not
depend on whether they have really
been treated or not
Accounting for covariates to
compensate for differences
G Varoquaux 7
10. 1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
G Varoquaux 8
11. 1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
Reweighting
Match the covariate distribution
of treated and non treated
Treated Non treated
Learn P(T|X) with an “AI”
G Varoquaux 8
12. 1 AI for causal inference [Funk... 2011]
Unconfoundedness {Y (1), Y (0)}
|=
T | X
Outcome regression
Model Y (T) = f (X, T)
f can be learned with an “AI”
(statistical machine learning)
Reweighting
Match the covariate distribution
of treated and non treated
Treated Non treated
Learn P(T|X) with an “AI”
More generally:
machine learning as non-parametric statistical estimator
G Varoquaux 8
13. 1 Remaining agenda: Machine learning can model this “dirty data”
1 From clinical studies to electronic health records
2 Learning on non-normalized data
3 Learning with missing values
G Varoquaux 9
14. 2 Learning on non-normalized data
[Cerda... 2018, Cerda and Varoquaux 2020]
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
Data expressed with categories
in non-standardized form
“Dirty categories”
G Varoquaux 10
15. 2 Dirty categories break standard statistical practice
Employee Position Title
Master Police Officer
Social Worker IV
Police Officer III
Police Aide
Electrician I
Bus Operator
Bus Operator
Social Worker III
Library Assistant I
Library Assistant I
OneHotEncoder not suitable
Overlapping categories
“Master Police Officer”,
“Police Officer III”,
“Police Officer II”...
High cardinality
400 unique entries
in 10 000 rows
Rare categories
Only 1 “Architect III”
New categories in test set
G Varoquaux 11
16. 2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
Police Officer II
Social Worker II
Police Officer III
⇒
Position Rank
Police Officer Master
Social Worker III
Police Officer II
Social Worker II
Police Officer III
G Varoquaux 12
17. 2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
Pfizer International LLC
Pfizer Limited
Pfizer Corporation Hong Kong Limited
Pfizer Pharmaceuticals Korea Limited
...
Difficult
without
supervision
G Varoquaux 12
18. 2 Standard approach: data curation Database normalization
Feature engineering
Employee Position Title
Master Police Officer
Social Worker III
...
⇒
Position Rank
Police Officer Master
Social Worker III
...
Merging entities Deduplication & record linkage
Output a “clean” database Company name
Pfizer Inc.
Pfizer Pharmaceuticals LLC
...
Hard to make automatic and turn-key
Our view: supervised learning on dirty categories
The statistical question should inform curation
Pfizer Corporation Hong Kong =
? Pfizer Pharmaceuticals Korea
G Varoquaux 12
19. 2 Simple fix: Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
X ∈ Rn×p
new categories
link categories
string distance(Londres, London)
G Varoquaux 13
20. 2 Simple fix: Adding similarities to one-hot encoding
One-hot encoding
London Londres Paris
Londres 0 1 0
London 1 0 0
Paris 0 0 1
Similarity encoding [Cerda... 2018]
London Londres Paris
Londres 0.3 1.0 0.0
London 1.0 0.3 0.0
Paris 0.0 0.0 1.0
X ∈ Rn×p
new categories
link categories
string distance(Londres, London)
= Prototype methods
How to choose a small number of prototypes?
The right prototypes may not be in training set
“big cat” “fat cat”
“big dog” “fat dog”
Estimate prototypes
G Varoquaux 13
21. 2 Substring information
Drug Name
alcohol
ethyl alcohol
isopropyl alcohol
polyvinyl alcohol
isopropyl alcohol swab
62% ethyl alcohol
alcohol 68%
alcohol denat
benzyl alcohol
dehydrated alcohol
Employee Position Title
Police Aide
Master Police Officer
Mechanic Technician II
Police Officer III
Senior Architect
Senior Engineer Technician
Social Worker III
G Varoquaux 14
22. 2 GaP Encoder, a latent category model [Cerda and Varoquaux 2020]
Topic model on sub-strings
(GaP: Gamma-Poisson factorization)
| {z }
3-gram1
L
| {z }
3-gram2
on
|{z}
3-gram3
do...
Model strings as a linear combination of substrings
11111000000000
00000011111111
10000001100000
11100000000000
11111100000000
11111000000000
police
officer
pol off
polis
policeman
policier
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
→
03078090707907
00790752700578
94071006000797
topics
030
007
940
009
100
000
documents
topics
+
What substrings
are in a latent
category
What latent categories
are in an entry
e
r
_
c
e
r
f
i
c
o
f
f
_
o
f
c
e
_
i
c
e
l
i
c
p
o
l
G Varoquaux 15
23. 2 String models of latent categories [Cerda and Varoquaux 2020]
Encodings
that extract
latent
categories
,
l
i
b
r
a
r
y
o
p
e
r
a
t
o
r
p
e
c
i
a
l
i
s
t
w
a
r
e
h
o
u
s
e
,
m
a
n
a
g
e
r
c
o
m
m
u
n
i
t
y
r
,
r
e
s
c
u
e
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
e
n
a
m
e
s
Categories
G Varoquaux 16
24. 2 String models of latent categories [Cerda and Varoquaux 2020]
Inferring
plausible
feature
names
a
s
s
i
s
t
a
n
t
,
l
i
b
r
a
r
y
e
q
u
i
p
m
e
n
t
,
o
p
e
r
a
t
o
r
t
r
a
t
i
o
n
,
s
p
e
c
i
a
l
i
s
t
t
s
w
o
r
k
e
r
,
w
a
r
e
h
o
u
s
e
g
,
p
r
o
g
r
a
m
,
m
a
n
a
g
e
r
m
e
c
h
a
n
i
c
,
c
o
m
m
u
n
i
t
y
e
r
,
r
e
s
c
u
e
r
,
r
e
s
c
u
e
c
o
r
r
e
c
t
i
o
n
,
o
f
f
i
c
e
r
Legislative Analyst II
Legislative Attorney
Equipment Operator I
Transit Coordinator
Bus Operator
Senior Architect
Senior Engineer Technician
Financial Programs Manager
Capital Projects Manager
Mechanic Technician II
Master Police Officer
Police Sergeant
r
r
e
d
f
e
a
t
u
r
e
n
a
m
e
s
Categories
G Varoquaux 16
25. 2 Un-blackboxify: Data science with dirty categories
Retrieving insight from machine-learning on non-curated data
Feature importances
Given a fitted model:
from s k l e a r n . i n s p e c t i o n import p e r m u t a t i o n i m p o r t a n c e
r = p e r m u t a t i o n i m p o r t a n c e (model , X val , y v a l ,
n r e p e a t s =30,
r a n d o m s t a t e =0)
G Varoquaux 17
[Cerda and Varoquaux 2020]
26. 2 Un-blackboxify: Data science with dirty categories
What characteristics of an employee are important to explain salary?
0.0 0.1 0.2
Information, Technology, Technologist
Officer, Office, Police
Liquor, Clerk, Store
School, Health, Room
Environmental, Telephone, Capital
Lieutenant, Captain, Chief
Income, Assistance, Compliance
Manager, Management, Property
Inferred feature names Permutation Importances
G Varoquaux 17
[Cerda and Varoquaux 2020]
27. 2 Dirty categories in practice
Software
DirtyCat: Dirty category software
http://dirty-cat.github.io
from d i r t y c a t import GapEncoder
g a p e n c o d e r = GapEncoder ()
t r a n s f o r m e d v a l u e s = g a p e n c o d e r . f i t t r a n s f o r m ( df )
Practical tip
Gradient-boosted trees work very well on tabular data
sklearn.ensemble.HistGradientBoostingRegressor
[Cerda... 2018, Cerda and Varoquaux 2020]
G Varoquaux 18
28. 3 Learning with missing values
[Josse... 2019]
Gender Date Hired Employee Position Title
M 09/12/1988 Master Police Officer
F NA Social Worker IV
M 07/16/2007 Police Officer III
F 02/05/2007 Police Aide
M 01/13/2014 Electrician I
M 04/28/2002 Bus Operator
M NA Bus Operator
F 06/26/2006 Social Worker III
F 01/26/2000 Library Assistant I
M NA Library Assistant I
G Varoquaux 19
29. 3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can
be obtained on observed data while ignoring the unobserved values.
Justification for imputation of missing values
G Varoquaux 20
30. 3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Theorem [Rubin 1976], in MAR, maximum likelihood of model a) can
be obtained on observed data while ignoring the unobserved values.
Justification for imputation of missing values
Missing Not at Random situation (MNAR)
Missingness not ignorable
Hard: need model of missing-values mechanism
G Varoquaux 20
31. 3 Classic statistics missing-values framework [Josse... 2019]
Model a) a complete data-generating process
Model b) a random process occluding entries
Missing at random situation (MAR)
Probability of missingness does not depend on unobserved values.
Missing Not at Random situation (MNAR)
Missingness not ignorable
Hard: need model of missing-values mechanism
2 0 2
2
0
2
Complete
2 0 2
2
0
2
MAR
2 0 2
2
0
2
MNAR
G Varoquaux 20
32. 3 Supervised learning with missing values
Difficulties
Half-discrete input space (NA ∪ R)
Complex predictor even in simple settings (linear + MAR)
[Le Morvan... 2020b]
Y = β?
1X1 + β?
2X2 + β?
0
cor(X1, X2) = 0.5.
If X2 is missing, the coefficient
of X1 should compensate for
the missingness of X2.
up to 2d
set of slopes
effect of X2lost effect of X2
accounted for by
X1
G Varoquaux 21
33. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
G Varoquaux 22
34. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less samples to approximate well and predict well
G Varoquaux 22
35. 3 Neumiss network: adapted neural architecture [Le Morvan... 2020a]
Derive theoretical forms of optimal predictors
Approximate them with functions learnable by neural networks
Taylored architecture which learns all slopes jointly
103
104
Number of parameters
0.00
−0.05
−0.10
R2
score
-
Bayes
rate
MLP Deep
MLP Wide
NeuMiss Test set
Train set Network
depth
1
3
5
7
9
width
1 d
3 d
10 d
30 d
50 d
NeuMiss needs less samples to approximate well and predict well
Also suitable for MNAR settings
G Varoquaux 22
36. AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
37. AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
Dirty categories
Non-normalized data
Latent categories via string forms
Dirty category software:
http://dirty-cat.github.io
38. AI, electronic health records, health
Electronic health records open new doors for cheaper studies
AI provides statistical estimators and information extraction
Dirty categories
Latent categories via string forms
Dirty category software:
http://dirty-cat.github.io
Supervised learning with missing data
Also suitable for MNAR
Broader picture: supervised learning without cleaning
http://project.inria.fr/dirtydata
39. Acknowledgements
Dirty categories
Patricio Cerda and Balazs Kegl
Missing data
Julie Josse, Erwan Scornet, Marine Le Morvan, Nicolas Prost
Electronic Health records
AP-HP, Alexandre Gramfort, Marc Lavielle, Lihu Chen,
Fabian Suchanek, Thomas Moreau, Antoine Neuraz...
40. 4 References I
P. Cerda and G. Varoquaux. Encoding high-cardinality string categorical
variables. IEEE Transactions on Knowledge and Data Engineering, 2020.
P. Cerda, G. Varoquaux, and B. Kégl. Similarity encoding for learning with dirty
categorical variables. Machine learning, 2018.
B. Colnet, I. Mayer, G. Chen, A. Dieng, R. Li, G. Varoquaux, J.-P. Vert, J. Josse,
and S. Yang. Causal inference methods for combining randomized trials and
observational studies: a review. arXiv preprint arXiv:2011.08047, 2020.
N. Dagan, N. Barda, E. Kepten, O. Miron, S. Perchik, M. A. Katz, M. A. Hernán,
M. Lipsitch, B. Reis, and R. D. Balicer. Bnt162b2 mrna covid-19 vaccine in a
nationwide mass vaccination setting. New England Journal of Medicine, 2021.
M. J. Funk, D. Westreich, C. Wiesen, T. Stürmer, M. A. Brookhart, and
M. Davidian. Doubly robust estimation of causal effects. American journal of
epidemiology, 173(7):761–767, 2011.
J. Josse, N. Prost, E. Scornet, and G. Varoquaux. On the consistency of
supervised learning with missing values. arXiv preprint arXiv:1902.06931, 2019.
41. 4 References II
M. Le Morvan, J. Josse, T. Moreau, E. Scornet, and G. Varoquaux. Neumiss
networks: differential programming for supervised learning with missing values.
In Advances in Neural Information Processing Systems 33, 2020a.
M. Le Morvan, N. Prost, J. Josse, E. Scornet, and G. Varoquaux. Linear
predictor on linearly-generated data with missing values: non consistency and
solutions. AISTATS, 2020b.
D. B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.