O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Sample size for binary logistic prediction models: Beyond events per variable criteria

305 visualizações

Publicada em

Presentation for conference MEMTAB 2018

Publicada em: Ciências
  • Seja o primeiro a comentar

Sample size for binary logistic prediction models: Beyond events per variable criteria

  1. 1. Sample size for binary logistic prediction models: Beyond events per variable criteria Maarten van Smeden, PhD Leiden University Medical Center Senior researcher MEMTAB 2018 Utrecht, July 3
  2. 2. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Sample size prediction modeling literature (2018)
  3. 3. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Events per variable (EPV)
  4. 4. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Events per variable (EPV)
  5. 5. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Events per variable (EPV) Critique • Flimsy supporting evidence for 10 EPV rule [1] • 50 EPV rule more realistic with traditional variable selection techniques [2] • 5 EPV sufficient to reduce (average) overfitting after “modern” shrinkage [3] • EPV only part of sample size story [4] [1] van Smeden et al., BMC MRM, 2014, doi: 10.1186/s12874-016-0267-3 [2] Steyerberg et al., Stat Med, 2000, doi: 10.1002/(SICI)1097-0258(20000430)19:8<1059::AID-SIM412>3.0.CO;2-0  [3] Pavlou et al., Stat Med, 2016, doi: 10.1002/sim.6782 [4] Ogundimu et al., JCE, 2016, doi: 10.1016/j.jclinepi.2016.02.031
  6. 6. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 EPV forgets about the intercept?
  7. 7. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 New sample size criteria: rMSPE Root Mean Squared Prediction Error (rMSPE): 
 standard deviation of out-of-sample probability prediction error Rational: since clinical prediction is about probability estimation, a sample size criterion should be based on allowable error rates in these estimates
  8. 8. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  9. 9. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  10. 10. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 *Coverage property not guaranteed: assuming errors are IID normal
  11. 11. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  12. 12. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Unfortunately no closed form solution for out-of-sample rMSPE
  13. 13. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Simulation study • 4,032 simulation conditions (factorial design)
 simulation factors: EPV (3 to 50), number candidate predictors (4 to 12), events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution and correlation predictors, number of noise variables • 5,000 replications per condition -> > 20 million simulation runs
  14. 14. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Simulation study • 4,032 simulation conditions (factorial design)
 simulation factors: EPV (3 to 50), number candidate predictors (4 to 12), events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution and correlation predictors, number of noise variables • 5,000 replications per condition -> > 20 million simulation runs • Each run: generate pairs of derivation data and validation data (large, with 5,000 expected events) and develop + validate various logistic prediction models • Will focus on maximum likelihood logistic regression
  15. 15. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Simulation study • 4,032 simulation conditions (factorial design)
 simulation factors: EPV (3 to 50), number candidate predictors (4 to 12), events fraction (1/16 to 1/2), area under ROC curve (0.65 to 0.85), distribution and correlation predictors, number of noise variables • 5,000 replications per condition -> > 20 million simulation runs • Each run: generate pairs of derivation data and validation data (large, with 5,000 expected events) and develop + validate various logistic prediction models • Will focus on maximum likelihood logistic regression • Simulation meta models: fit linear (Ridge) regression models to predict simulation outcome (rMSPE) from simulation factors
  16. 16. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Simulation meta models rMPSE • Meta-model with 3 (of 7) factors: N, events fraction and number of (candidate) predictors: R2 = 0.992 • (Meta-model with only EPV as factor: R2 = 0.432) https://mvansmeden.shinyapps.io/BeyondEPV/
  17. 17. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  18. 18. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 In press Thanks to Richard Riley for commenting on early draft
  19. 19. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Final remarks • 10 EPV prediction models can produce widely inaccurate probability estimates • New sample size criterion - based on rMSPE - could be accurately approximated by predictable data characteristics • Validation, analytical work, and extensions still needs to be done • Our new sample size calculation shiny-app is “Beta”; can be used to approximate rMSPE for settings that stay close to our simulation design (article in press) • One sample criterion probably isn’t always enough. Notably, low events fraction settings may come with low rMSPE and high need of shrinkage
  20. 20. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Final remarks Binary logistic regression sample size recommendations 1. Think about allowable probability prediction error (e.g. in terms of 95% coverage regions) 2. If you can, run a realistic simulation study 3. If you can’t do 2, use our shiny-app with caution to calculate minimal sample size
  21. 21. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 https://mvansmeden.shinyapps.io/BeyondEPV/
  22. 22. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  23. 23. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018
  24. 24. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Logistic prediction models Schmidt et al., Schizo Bulletin, 2017, doi:10.1093/schbul/sbw098; Damen et al., BMJ, 2017, doi:10.1136/bmj.i2416; Collins et al., BMC MRM, 2014, doi:10.1186/1471-2288-14-40; Collins et al., BMC Med, 2011, doi: 10.1186/1741-7015-9-103; Bouwmeester et al., Plos Med, 2012: 10.1371/journal.pmed.1001221.
  25. 25. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 New sample size criterion Use expected root Mean Squared Prediction Error (rMSPE) Interpretation: standard deviation of expected out-of-sample probability prediction error Where are the unobservable “true” probabilities that would have been obtained would the prediction model have been derived with correct functional form and infinite sample size; are estimated probabilities from the derived model in a large external set of similar individuals (“out-of- sample”). rMSPE = E[(πi − ̂πi)2 ], πi ̂πi
  26. 26. Slides available at: https://www.slideshare.net/MaartenvanSmeden/presentations MEMTAB, Utrecht, July 3 2018 Difference between estimated probability from a prediction model when applied in large sample validation study vs “true” probability obtained when the same model would have been derived from an infinitely large sample

×