Sample size for binary logistic prediction models: Beyond events per variable criteria

Sample size for binary logistic prediction models:
Beyond events per variable criteria
Maarten van Smeden, PhD
Leiden University Medical Center

Senior researcher

MEMTAB 2018

Utrecht, July 3

Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
Sample size prediction modeling literature (2018)

Events per variable (EPV)

Events per variable (EPV)
Critique

• Flimsy supporting evidence for 10 EPV rule [1]

• 50 EPV rule more realistic with traditional variable selection techniques [2]

• 5 EPV suﬃcient to reduce (average) overﬁtting after “modern” shrinkage [3]

• EPV only part of sample size story [4]

[1] van Smeden et al., BMC MRM, 2014, doi: 10.1186/s12874-016-0267-3

[2] Steyerberg et al., Stat Med, 2000, doi: 10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0

[3] Pavlou et al., Stat Med, 2016, doi: 10.1002/sim.6782

[4] Ogundimu et al., JCE, 2016, doi: 10.1016/j.jclinepi.2016.02.031

EPV forgets about the intercept?

New sample size criteria: rMSPE
Root Mean Squared Prediction Error (rMSPE):  
standard deviation of out-of-sample probability prediction error

Rational: since clinical prediction is about probability estimation, a
sample size criterion should be based on allowable error rates in these
estimates

*Coverage property not guaranteed: assuming errors are IID normal

Unfortunately no closed form solution for out-of-sample rMSPE

Simulation study
• 4,032 simulation conditions (factorial design) 
simulation factors: EPV (3 to 50), number candidate predictors (4 to 12),
events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution
and correlation predictors, number of noise variables

• 5,000 replications per condition -> > 20 million simulation runs

Simulation study

• Each run: generate pairs of derivation data and validation data
(large, with 5,000 expected events) and develop + validate various
logistic prediction models

• Will focus on maximum likelihood logistic regression

Simulation study

• Each run: generate pairs of derivation data and validation data
(large, with 5,000 expected events) and develop + validate various
logistic prediction models

• Will focus on maximum likelihood logistic regression

• Simulation meta models: ﬁt linear (Ridge) regression models to predict
simulation outcome (rMSPE) from simulation factors

Simulation meta models
rMPSE

• Meta-model with 3 (of 7) factors: N, events fraction and number of
(candidate) predictors: R2 = 0.992
• (Meta-model with only EPV as factor: R2 = 0.432)
https://mvansmeden.shinyapps.io/BeyondEPV/

In press
Thanks to Richard Riley for commenting on early draft

Final remarks
• 10 EPV prediction models can produce widely inaccurate probability
estimates

• New sample size criterion - based on rMSPE - could be accurately
approximated by predictable data characteristics

• Validation, analytical work, and extensions still needs to be done

• Our new sample size calculation shiny-app is “Beta”; can be used to
approximate rMSPE for settings that stay close to our simulation
design (article in press)

• One sample criterion probably isn’t always enough. Notably, low events
fraction settings may come with low rMSPE and high need of shrinkage

Final remarks
Binary logistic regression sample size recommendations

1. Think about allowable probability prediction error (e.g. in terms of 95%
coverage regions)

2. If you can, run a realistic simulation study

3. If you can’t do 2, use our shiny-app with caution to calculate minimal
sample size

https://mvansmeden.shinyapps.io/BeyondEPV/

Logistic prediction models
Schmidt et al., Schizo Bulletin, 2017, doi:10.1093/schbul/sbw098; Damen et al., BMJ, 2017, doi:10.1136/bmj.i2416; Collins et al., BMC MRM, 2014, doi:10.1186/1471-2288-14-40; Collins et al., BMC Med, 2011, doi:
10.1186/1741-7015-9-103; Bouwmeester et al., Plos Med, 2012: 10.1371/journal.pmed.1001221.

New sample size criterion
Use expected root Mean Squared Prediction Error (rMSPE)

Interpretation: standard deviation of expected out-of-sample probability
prediction error

Where are the unobservable “true” probabilities that would have been
obtained would the prediction model have been derived with correct
functional form and inﬁnite sample size; are estimated probabilities from
the derived model in a large external set of similar individuals (“out-of-
sample”).

rMSPE = E[(πi − ̂πi)2
],
πi
̂πi

Diﬀerence between estimated probability from a prediction model
when applied in large sample validation study vs “true” probability
obtained when the same model would have been derived from an
inﬁnitely large sample

Sample size for binary logistic prediction models: Beyond events per variable criteria

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Sample size for binary logistic prediction models: Beyond events per variable criteria

Semelhante a Sample size for binary logistic prediction models: Beyond events per variable criteria (20)

Mais de Maarten van Smeden

Mais de Maarten van Smeden (20)

Último

Último (20)

Sample size for binary logistic prediction models: Beyond events per variable criteria