2. The vision …..
Streamlined sequence of processes
EMR Predictive Model Decision Support
Clinical Workflow Creation at Point of Care
Capture data entered as Automated E-T-L Vendor ‘neutral’ scoring
part of routine clinical processes tools:
workflow Machine learning Intranet based
algorithms for target class JSON serialization
prediction
3. Agenda / Table of contents
1 Readmission after Heart Failure
2 Data Structure of an Electronic Medical Record
3 TreeNet™ Modeling with our Dataset
4 Lessons Learned and Next Step(s)
7. Model Error: The Bias – Variance Decomposition
Prediction Error = Irreducible Error + Bias² + Variance
Hastie T, Tibshirani R, Friedman JH. The elements of statistical learning :
Data mining, inference, and prediction. New York, NY: Springer; 2009.
8. Model caveats
Association does not prove causality
Models are retrospective (observational)
and therefore hypothesis generating (i.e.
not hypothesis proving)
10. Congestive Heart Failure
Common cause for admission.
Readmission in excess of 23%. Bueno, H. et al.
JAMA 2010;303:2141-2147
Risk factors for readmission extensively studied.
Published reviews cite over 120 studies.
- Methods: Logistic regression; Cox proportional hazard
- C-statistic in 0.6-0.7 range
Reduction of readmission has been declared a national goal.
Improved risk models have the potential to more effectively deploy
targeted disease management.
11. EMR data structure
Data collected for clinical workflow.
Large volume
- Multiple observations; repeated measures
- Many interactions and interdependencies
Complex dataset
- Continuous, Ordinal, Nominal (low and high order), Binary
- ‘High-order variable-dimension nominal variables’
Missing data:
- May represent error or practice patterns
Unbalanced classes
Outliers and Entry errors
12. Preliminary Dataset
- 1612 consecutive heart failure discharges abstracted
- 1280 candidate predictors screened
- Target class: Readmission at 30 days ( binary )
Administrative candidate predictors Clinical candidate predictors
•Admission source, status, service •Specialty medical services consulted
•Age, gender, race •Specialty ancillary services consulted
•Primary/secondary payers •Blood laboratory values
•Primary/secondary diagnoses (names •Medications name / therapeutic class
and condition categories) •Dosages of medications
•Total length of stay, ICU length of stay •Patient weights during hospitalization
•Hospital costs and charges •Transfusions during hospitalization
•Discharge status and disposition •Nursing assessments
•All-cause same-center admission in •Education topics
preceding year •Diagnostic tests ordered
•Ordersets utilized
Preliminary Unpublished Data
13. Benefits of Stochastic Gradient Boosting
Friedman JH. Stochastic gradient boosting. Computational Statistics and Data Analysis 2002; 38(4):367- 378.
Input and processing Output
Does not require data High model accuracy
transformation
Classification and regression
Handles large numbers of
Non-parametric application of
categorical and continuous logistic , L1, L2, or Huber-M
variables loss function
Has mechanisms for:
- Feature selection
- Managing missing values
- Assessing the relationship of
predictors to target
Robust to:
- Data entry errors, Outliers,
Target misclassification
14. TreeNet™ Modeling with our Dataset
1 Parameters of ‘feature fit’
2 Parameters of ‘feature selection’
3 Elements of insight
4 Putting it all together
16. Feature selection – variable importance
Variable Importance
Calculation
Squared relative improvement
is the sum of squared
improvements (in squared
error risk) over all internal
nodes for which the variable
was chosen as the splitting
variable
Measurements are relative
and is customary to express
largest at 100 and scale other
variables appropriately
17. Insight into the model
Illuminating the ‘black box’ with partial dependence
Preliminary Unpublished Data
18. Approach to feature selection
Domain ‘Neutral’ vs. Domain ‘Centric’
Domain Neutral Both Domain Centric
Start with a subset Know your data Use all potential
based on univariate Univariate stats predictors
significance (i.e. P- Use knowledge of
value below a given Application of Variable
Importance target and predictors to
level) or variance make decisions on
above a given Screening with inclusion (or rejection)
threshold batteries of predictor
Forward and backward
stepwise progression
19. Model Variability
Establishing AUC precision and accuracy
Variation Accuracy / Precision
The model is fit via sampling (i.e. S.E.M.= S.D. / sqrt ( N )
stochastic) process. Precision (95%) ≈ 4 * S.E.M.
S.D. = 0.03
N trials 10 30 300
S.E.M. .0095 .0055 .0017
Precision (95%) .038 .022 .007
20. Precision and predictor selection
STEP_1 (197) (0.531) STEP_66 (737) (0.703)
0.75
Min = 0.5057
Median = 0.6738
0.70 Mean = 0.6500
Max = 0.7034
Avg. ROC
0.65
Test ROC
0.60
0.55
0.50
AUC estimated using CV-10 ( = 10 trees) SEM .0095 and
precision (95%) of .038
Repeating CV-10 (using CVR battery) 30 times SEM .0017 and
precision (95%) of .007
Profound implication on dimensionality of model achievable without
domain knowledge input.
21. How much of a change in AUC is clinically relevant ?
Gain Curve complements ROC curve
Preliminary Unpublished
Data
22. Useful batteries for feature selection
Methods of forward and backward selection
STEPWISE
Testing set to CV-10
Select predictor 1-2 at
a time
Confirm with CVR
battery
SHAVING
23. BUILDING A MODEL **Each change confirmed with
CVR (30 reps). Review partial
This is a multi-step process dependence plot.
Run model with all candidate Use backward and forward Re-examine discarded
predictors. Select N highest selection to reduce predictors in smaller
important predictors. preliminary model to a core groups. Use backward
N= 2-3 x final size 5 -15 predictors.** and forward selection.**
Step 1 Step 2 Step 3 Step 4 Step 5
Run batteries to assess Review predictors and use domain
parameters of feature ‘fit’. knowledge to eliminate redundant
Assess model (AUC) (dependent) predictors and consider
variability. Repeat as predictors of known value. **
needed through process.
24. Initial runs
Information content and irreducible error
#287 (0.519) #287 (0.519) (0.436)
0.576 0.6
Cross Entropy
0.576 0.5
0.4
0.3
Train
0.2
0.1
Test 6 Nodes
0.0
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Number of Trees
#880 (0.518) #880 (0.518) (0.457)
0.576 0.6
Cross Entropy
0.577 0.5
0.4
0.3
Train
0.2
0.1
Test 2 Nodes
0.0
0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
Number of Trees
Preliminary Unpublished Data
25. Sample Model
GAIN CURVE ROC CURVE
FEATURE SELECTION SET
MODEL TRAINING SET
Preliminary Unpublished Data
26. Sample partial dependence plots
The value of non-parametric regression
Admissions within prior year ICU Days
Anion Gap Initial Systolic BP
Final BNP BUN-Creatinine Ratio
Preliminary Unpublished Data
27. Prospective application
Additional heart failure discharges can be scored against the model
GAIN curve ROC curve
Preliminary Unpublished
Causes for performance shift Data
Overfitting in the original model
Concomitant intervention programs are altering
patient risk of readmission
28. Non-influential candidate predictors
Models favor continuous over binary ‘dummy’ variables
Diagnoses and QualNet Condition Categories
Medications and Therapeutic Categories
Diagnostic Tests
Ordersets Submitted
Preliminary Unpublished Data
29. Lessons learned
TreeNet ( stochastic gradient boosting) is extremely well suited to
data structure of EMR data.
Insight in to dataset is a rich feature (in and above prediction
performance).
Model performance variance is important in feature selection.
- Consequence of limited information content in our dataset.
Batteries are useful.
- PARTITION – Variability assessment
- CVR – Model assessment
- STEPWISE – Forward selection
- SHAVING – Backward selection
There is great value in learning on a non-trivial dataset within a
familiar domain.
30. Next steps ……
Explore options to manage model variability
and increase dimensionality of predictor set.
Extend analysis of predictor interactions.
Develop mechanism of ‘point-of-care’ patient
scoring.
Apply techniques to new problems and
dataset.
Stochastic Gradient Boosting is the algorithm that underlies the TreeNet application. Discussing this at a Salford conference is like bringing coal to Newcastle – won’t embarrass myself –Extorts several characteristics that are attractive for EMR ( and most any) datasets