Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Sample size for binary logistic prediction models: Beyond events per variable criteria
1. Sample size for binary logistic prediction models:
Beyond events per variable criteria
Maarten van Smeden, PhD
Leiden University Medical Center
Senior researcher
MEMTAB 2018
Utrecht, July 3
2. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Sample size prediction modeling literature (2018)
3. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Events per variable (EPV)
4. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Events per variable (EPV)
5. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Events per variable (EPV)
Critique
• Flimsy supporting evidence for 10 EPV rule [1]
• 50 EPV rule more realistic with traditional variable selection techniques [2]
• 5 EPV sufficient to reduce (average) overfitting after “modern” shrinkage [3]
• EPV only part of sample size story [4]
[1] van Smeden et al., BMC MRM, 2014, doi: 10.1186/s12874-016-0267-3
[2] Steyerberg et al., Stat Med, 2000, doi: 10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0
[3] Pavlou et al., Stat Med, 2016, doi: 10.1002/sim.6782
[4] Ogundimu et al., JCE, 2016, doi: 10.1016/j.jclinepi.2016.02.031
6. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
EPV forgets about the intercept?
7. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
New sample size criteria: rMSPE
Root Mean Squared Prediction Error (rMSPE):
standard deviation of out-of-sample probability prediction error
Rational: since clinical prediction is about probability estimation, a
sample size criterion should be based on allowable error rates in these
estimates
8. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
9. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
10. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
*Coverage property not guaranteed: assuming errors are IID normal
11. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
12. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Unfortunately no closed form solution for out-of-sample rMSPE
13. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Simulation study
• 4,032 simulation conditions (factorial design)
simulation factors: EPV (3 to 50), number candidate predictors (4 to 12),
events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution
and correlation predictors, number of noise variables
• 5,000 replications per condition -> > 20 million simulation runs
14. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Simulation study
• 4,032 simulation conditions (factorial design)
simulation factors: EPV (3 to 50), number candidate predictors (4 to 12),
events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution
and correlation predictors, number of noise variables
• 5,000 replications per condition -> > 20 million simulation runs
• Each run: generate pairs of derivation data and validation data
(large, with 5,000 expected events) and develop + validate various
logistic prediction models
• Will focus on maximum likelihood logistic regression
15. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Simulation study
• 4,032 simulation conditions (factorial design)
simulation factors: EPV (3 to 50), number candidate predictors (4 to 12),
events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution
and correlation predictors, number of noise variables
• 5,000 replications per condition -> > 20 million simulation runs
• Each run: generate pairs of derivation data and validation data
(large, with 5,000 expected events) and develop + validate various
logistic prediction models
• Will focus on maximum likelihood logistic regression
• Simulation meta models: fit linear (Ridge) regression models to predict
simulation outcome (rMSPE) from simulation factors
16. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Simulation meta models
rMPSE
• Meta-model with 3 (of 7) factors: N, events fraction and number of
(candidate) predictors: R2 = 0.992
• (Meta-model with only EPV as factor: R2 = 0.432)
https://mvansmeden.shinyapps.io/BeyondEPV/
17. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
18. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
In press
Thanks to Richard Riley for commenting on early draft
19. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Final remarks
• 10 EPV prediction models can produce widely inaccurate probability
estimates
• New sample size criterion - based on rMSPE - could be accurately
approximated by predictable data characteristics
• Validation, analytical work, and extensions still needs to be done
• Our new sample size calculation shiny-app is “Beta”; can be used to
approximate rMSPE for settings that stay close to our simulation
design (article in press)
• One sample criterion probably isn’t always enough. Notably, low events
fraction settings may come with low rMSPE and high need of shrinkage
20. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Final remarks
Binary logistic regression sample size recommendations
1. Think about allowable probability prediction error (e.g. in terms of 95%
coverage regions)
2. If you can, run a realistic simulation study
3. If you can’t do 2, use our shiny-app with caution to calculate minimal
sample size
21. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
https://mvansmeden.shinyapps.io/BeyondEPV/
22. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
23. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
24. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Logistic prediction models
Schmidt et al., Schizo Bulletin, 2017, doi:10.1093/schbul/sbw098; Damen et al., BMJ, 2017, doi:10.1136/bmj.i2416; Collins et al., BMC MRM, 2014, doi:10.1186/1471-2288-14-40; Collins et al., BMC Med, 2011, doi:
10.1186/1741-7015-9-103; Bouwmeester et al., Plos Med, 2012: 10.1371/journal.pmed.1001221.
25. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
New sample size criterion
Use expected root Mean Squared Prediction Error (rMSPE)
Interpretation: standard deviation of expected out-of-sample probability
prediction error
Where are the unobservable “true” probabilities that would have been
obtained would the prediction model have been derived with correct
functional form and infinite sample size; are estimated probabilities from
the derived model in a large external set of similar individuals (“out-of-
sample”).
rMSPE = E[(πi − ̂πi)2
],
πi
̂πi
26. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Difference between estimated probability from a prediction model
when applied in large sample validation study vs “true” probability
obtained when the same model would have been derived from an
infinitely large sample