CONTENTS
Business Problem
Methodology & Process
How does the model get Deployed - 30K feet view
Where else will the lender use the models?
Do other industries use this framework too?
References for reading materials
Intended for Knowledge Sharing only
2
BUSINESS PROBLEM
BUSINESS PROBLEM
Risk based Approval/Pricing
Framework
1
What are the chances of non-repayment?
2
If it happens, how much money will go bad?
3
How Business sees it?
How much will I ultimately recover if I repossess and sell off
the vehicle?
Note:
* Non-repayment is defined as payments delayed by over 180 days since the due date.
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
4
BUSINESS PROBLEM
BUSINESS PROBLEM
Risk based Approval/Pricing
Framework
How Statisticians See
it?
1
2
3
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
5
BUSINESS PROBLEM
BUSINESS PROBLEM
Risk based Approval/Pricing
Framework
How Analysts See it?
1
Probability of non-repayment (PD)
2
Estimated $ of non-repayment (EAD)
3
Loss Post Recovery(LGD)
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
6
HOW IS IT DONE?
First step would be to convert a business problem into Analytical Framework (Label & Inputs), followed by….
Data Preparation
Dimensionality
Reduction
Modeling & Analysis
Validation
Recommendations &
Implementation
Strategy
● Hypotheses - Important drivers and expected relationship
● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship
● Multivariate - VIF & CI (Similar to PCA)
● Model building on Development Sample
-Identification of statistically significant drivers, Overall fit & Accuracy
● Model rebuilding on Validation Sample
-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
7
HOWEVER IT SHOULD BE PRECEDED BY SEGMENTATION
Customers need to be bucketed into homogenous buckets, to normalize for inherent variation between various
types of customers/products etc.
Loan Term
Credit Score Bands
Low End
Models
Mid Range
Models
Luxury
Brands
Least Score Range
3
1 year
Mid Score Range
1
High Score Range
4
5
Least Score Range
3
2 year
Mid Score Range
High Score Range
2
4
Intended Knowledge Sharing only
Intended forfor KnowledgeSharingonly
8
TRANSLATE INTO ANALYTICAL FRAMEWORK
A model is a mathematical relationship between a “Target/Label” Variable and the “Predictor/Input” variables.
Here “Non-repayment” is the “Target/Label” and application information are “Predictors/Input Variables”…
Non-repayment = f {application data like Credit Score, %Monthly Payment to
Income, etc.}
We build models on a historical sample, i.e., where we have both application data and what happened with that
application later on over the loan term….
Predictors/Input
Variables
Appl_ID
1
2
3
Crd_sc %Pymt_Inc
750
10%
500
70%
650
25%
Customer info
at the time of application
Target/Labels
Appl_ID NP_Flag
When
1
No
2
Yes
5th Month
3
No
-
Modeling Data
Predictors/Input Variables
+ Target/Labels
Appl_ID Crd_sc %Pymt_Inc
1
750
10%
2
500
70%
3
650
25%
NP_flag
When
No
Yes
5th Month
No
-
Non-repayment
info over
loan term
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
9
DATA CREATION- PREDICTOR VARIABLES & HYPOTHESES
DATA
TYPE
VARIABLES
EXPECTED RELATIONSHIP
Absolute values
Credit Score
Payment to Income Ratio
Debt to Income Ratio
#Inquiries in last qtr, 12 months
Total Outstanding Loan
Bankrupty, Non-repayments, Charge offs, etc.
-ve
+ve
+ve
+ve
+ve
+ve
Deviations in Slope and
Level
Trend, Shocks, etc.
-ve/+ve
Total Loan Requested
Term of the loan
Depends
-ve/+ve
Depends on market demand for
the Make/Model
-ve
New = -ve
BUREAU DATA
LOAN DETAILS
DEMOGRAPHIC
DETAILS
Absolute values
Absolute values
MACROECONOMIC
DATA
Absolute values
GEO DATA
Absolute Values
TRANSACTIONS
DATA
Absolute values
Deviation
Deviation
Make/Model/Model Year of the Car
Past relationship with the Lender
New/Used Car
Home Owner/Renter, #Dependents, Gender,
Marital Status, Age,Occupation, Education,
Profession
GDP, Household Savings Ratio, Fuel Prices,
Unemployment Rate, Interest Rates, etc.
Trend, Shocks, etc.
City, State, Region Cluster, Local Competition Data,
Dealership level factors, etc.
Monthly Payments, #Payments made, #Nonrepayments, Time to CO, Amount of Nonrepayment, Recovery Rate, etc.
Trend, Shocks, etc.
Depends on the variable
Depends on the variable
Depends on the variable
Depends on the variable
Depends on the variable
Depends on the variable
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
10
HOW IS IT DONE?
Data Preparation
Dimensionality
Reduction
Modeling & Analysis
Validation
Recommendations &
Implementation
Strategy
● Hypotheses - Important drivers and expected relationship
● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship
● Multivariate - VIF & CI (Similar to PCA)
● Model building on Development Sample
-Identification of statistically significant drivers, Overall fit & Accuracy
● Model rebuilding on Validation Sample
-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
11
DATA PREPARATION
CAPPING & MISSING VALUE TREATMENT
Capping treatment is necessary to remove the effect of extreme/non-sensical values, very different from the rest
of population….
No.
Pyoffflg
Prin0105
Loanamt
Term
Fixed
Agnsttr
Bbctrad
Nummortt
Rvoptbal
Numminq
Missing
282
Numminq3
observations
0
1
0
2324.9
19900
360
1
21
282
1
2
0
3796.5
22100
240
0
6
6911
1
33978
1
1
3
1
12523.2
42000
360
1
1
36350
.
36732
1
1
4
0
5190.9
21760
349
1
42
885
1
911
0
0
5
1
53.6
18000
360
1
5
8851
1
9506
0
0
6
0
1256.9
15500
360
.
13
409
1
760
0
0
7
0
4403.3
25150
900
1
3
21417
5
23579
3
1
8
0
3137.2
17800
240
1
4
4528
2
5967
1
0
9
0
4256.5
9999999
360
18179
47
130683
4
1
10
0
6442.4
31200
360
33177
1
0
2
0
Unrealistic values
1
9
1
34
0
….Missing treatment is imputation of missing values for certain variables, and is mandatory. If left unattended,
entire record is excluded from Modeling.
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
12
HOW IS IT DONE?
Data Preparation
Dimensionality
Reduction
Modeling & Analysis
Validation
Recommendations &
Implementation
Strategy
● Hypotheses - Important drivers and expected relationship
● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship
● Multivariate - VIF & CI (Similar to PCA)
● Model building on Development Sample
-Identification of statistically significant drivers, Overall fit & Accuracy
● Model rebuilding on Validation Sample
-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
13
DIMENSIONALITY REDUCTION
BIVARIATE ANALYSIS
Bivariate analysis explores the nature and degree of relationship between the independent and dependent
variables….
•
Rank Plots: Checks if the predictor variables correlate with Target variable.
Steps:
• Sort the population by predictor variable values
• Split into groups with equal number of obs, generally ten groups or deciles
• Get the average of Target variable in each group
• Check if there is a trend in average value of Target variables from the top group to bottom
Dummy = (predictor value<=2)
30
Avg Target
Avg Target
50
40
30
20
25
No relationship
20
15
10
5
10
0
0
0
1
2
Predictor Deciles
3
4
40
60
80
Predictor Deciles
100
..it not only helps in finding related predictors, predictor transformations, it also helps in dimensionality reduction
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
14
DIMENSIONALITY REDUCTION
MULTIVARIATE ANALYSIS
Two metrics that are predominantly used are Variance Inflation Factor (VIF) and Conditional Index (CI)….
Variance Inflation factor (VIF)
VIF is obtained by regressing each independent variable, say X on the remaining independent variables
(say x1 and x2) and checking how much of it (of X) is explained by these variables.
->Cut-offs used vary from 2 to 10
Conditional Index (CI)
Conditional Index is the square root of the ratio of the highest eigen value (λmax) and individual eigen
value (λ).
->Cut-offs used vary from 13 to 30
Very similar to Principal Component Analysis (PCA)
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
15
GENERALIZED LINEAR MODELS
SAMPLE VIF/CI OUTPUT
The REG Procedure
Model: MODEL1
Dependent Variable: NP_Flag
Number of Observations Read
Number of Observations Used
Source
Model
Error
Corrected Total
Root MSE
Dependent Mean
Coeff Var
40162
40162
Analysis of Variance
DF
Sum of
Squares
12
610.91533
40149
9332.36401
40161
9943.27934
0.48212
0.5492
87.78642
Variable
DF
Intercept
Credit_Score
%Down_Pymt_to_Loan
%Mnthly_Pymt_to_Loan
1
1
1
1
Number
1
2
3
8
9
10
11
12
13
Eigenvalue
R-Square
Adj R-Sq
Mean
Square
50.90961
0.23244
F Value
Pr > F
219.02<.0001
0.0614
0.0612
Parameter Estimates
Parameter
Standard
t Value
Pr > |t|
Estimate
Error
1.24953
0.20693
6.04
<.0001
-0.000216
0.00028377
-0.76
0.4465
-0.1166
0.0117
-9.96
<.0001
0.01966
0.00517
-3.8
0.0001
Collinearity Diagnostics
Condition
Index
Intercept
8.3631
1.01345
0.96895
0.22138
0.20341
0.05087
0.02578
0.00137
0.00007104
1
2.87264
2.93787
6.14626
6.41212
12.82208
18.01153
78.10783
343.097
0.00000188
8.65E-09
2.42E-11
0.00000754
0.00001611
0.00000322
0.00082432
0.01375
0.98539
Variance
Inflation
0
1.0205
1.09417
1.17587
Proportion of Variation
Credit_Score %Down_Pymt_to %Mnthly_Pymt_
_Loan
to_Loan
0.00000202
0.00002708
0.00057815
8.73E-09
1.04E-07
5.68E-06
5.60E-14
1.68E-09
0.0000019
0.00000817
0.00009252
0.00396
0.00001745
0.00020511
0.01911
0.00000279
0.00011988
0.26143
0.00088072
0.00992
0.68574
0.01859
0.96941
0.02085
0.98048
0.02008
0.00000173
Intended Knowledge Sharing only
Intended forfor KnowledgeSharingonly
16
HOW IS IT DONE?
Data Preparation
Dimensionality
Reduction
Modeling & Analysis
Validation
Recommendations &
Implementation
Strategy
● Hypotheses - Important drivers and expected relationship
● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship
● Multivariate - VIF & CI (Similar to PCA)
● Model building on Development Sample
-Identification of statistically significant drivers, Overall fit & Accuracy
● Model rebuilding on Validation Sample
-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
17
MODELING DETAILS
1
What are the chances of Non-repayment?
Probability of Non-repayment
(PD)
2
If it happens, how much money will go bad?
Predict the $ amount at risk of
Non-repayment (EAD)
3
How much will I ultimately recover if I repossess and sell off
the vehicle?
Estimate the % of Amount at risk
that cannot be recovered (LGD)
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
18
MODELING DETAILS
1
Probability of Non-repayment (PD)
Logistic Model
2
Predict the $ amount at risk of Non-repayment (EAD)
OLS Model
3
Predict the % of Amount at risk that cannot be recovered**
(LGD)
Average by Risk Deciles
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
19
MODELING DETAILS
1
Probability of Non-repayment (PD)
Logistic Model
2
Predict the $ amount at risk of Non-repayment (EAD)
OLS Model
3
Predict the % of Amount at risk that cannot be recovered**
(LGD)
Average by Risk Deciles
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
21
LOGISTIC REGRESSION
What is a Logistic Model?
->Predicts log odds(event/non-event)
->Predictive Model is as a mathematical relationship between the predictors and Target
Log (odds) = α + β1X1 + β2X2
SAS procedure: Proc Logistic (with various link functions)
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
22
HOW TO FIND IF A METHOD WORKS?
For Logistic Models, following metrics are used as Performance diagnostics…
•
Concordance/Discordance: Overall indicator of the model prediction accuracy
• Pair all observations randomly
• Check the %pairs where the “bad” guy is given higher probability vs. the “good” guy
•
Rank Order: Similar test like above, but a more structured format
Steps:
• Sorting: Sort the population by predicted probability
• Deciling: Bucket them into ten groups, each having 10% of the population in the sorted order
• Check the %Non-repayment guys in each decile
• Capturing: Ideally %bad guys should be highest in top deciles and lowest in bottom deciles. Top
deciles should capture most of the Non-repayment guys.
•
Gains Chart: Graphical representation of capturing by the model and performance against random
bucketing.
•
Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum
information capture with least number of predictors.
…apart from usual checks on Signs, Statistical Significance and if the model holds in
the validation samples also
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
23
SAMPLE MODEL OUTPUT
Effect
APPLICATION_PRIM_CB_
%Down_Pymt_to_Loan
%Mnthly_Pymt_to_Loan
Type 3 Analysis of Effects
DF
Wald
Chi-Square
2
14.5230
2
126.6605
2
83.5880
Effect
APPLICATION_PRIM_CB_
%Down_Pymt_to_Loan
%Mnthly_Pymt_to_Loan
0.0007
<.0001
<.0001
Analysis of Maximum Likelihood Estimates
DF
Development
Validation
Standard
Model Estimate Model Estimate
Error
1.1321
-0.4085
0.8909
-0.00349
-0.00220
0.00122
-0.3934
-0.2839
0.0485
0.1206
-0.0900
0.0221
Parameter
Intercept
APPLICATION_PRIM_CB_
%Down_Pymt_to_Loan
%Mnthly_Pymt_to_Loan
Pr > ChiSq
1
1
1
1
outcome
Odds Ratio Estimates
Point Estimate
1
0.998
1
0.753
1
0.914
Wald
Chi-Square
0.2102
3.2494
34.2834
16.5920
Pr > Chi
Sq
0.6466
0.0715
<.0001
<.0001
95% Wald Confidence Limits
0.995
1.000
0.685
0.828
0.875
0.954
Percent Concordant
65.9
Somers' D
0.338
Percent Discordant
32.1
Gamma
0.345
Percent Tied
2.0
Tau-a
0.074
c
0.669
Pairs
1806529536
Higher the percent
concordant, better
the model
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
24
SAMPLE GAINS CHART
120
Model
capturing
100
Higher the capturing
in the initial deciles,
better the model
performance
80
Responders
captured
60
40
Random
capturing
20
0
0
20
40
60
80
100
Population (%)
Intended Knowledge Sharing only
Intended forfor KnowledgeSharingonly
25
MODELING DETAILS
1
Probability of Non-repayment (PD)
Logistic Model
2
Predict the $ amount at risk of Non-repayment (EAD)
OLS Model
3
Predict the % of Amount at risk that cannot be recovered**
(LGD)
Average by Risk Deciles
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
26
OLS MODELS
What is a Linear Model?
->Predicts the value of the Target variable
->Predictive Model is as a mathematical relationship between the predictors and Target
y =α + β1X1 + β2X2
* These models are developed only on the “bad” population, since including “good” will
skew the model.
SAS procedure: Proc Reg
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
27
HOW TO FIND IF A METHOD WORKS?
For Linear Models, following metrics are used as Performance diagnostics…
•
R-square: Tells how much of the variance in “Target” variable is captured by the model.
•
Error rate(%): Tells what is the error relative to actual values of Target variable.
Error rate (%) = average of square(actual – predicted)/average of actuals
•
Rank Order: Checks if the predicted values correlate with actual values.
Steps:
• Sorting: Sort the population by predicted values
• Deciling: Bucket them into ten groups, each having 10% of the population in the sorted order
• Check the average value of prediction in each decile and average value of actuals in each deciles
• Check if both averages are gradually decreasing from the top group to bottom
•
Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum
information capture with least number of predictors.
…apart from usual checks on Signs, Statistical Significance and if the model holds in
the validation samples also
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
28
SAMPLE MODEL OUTPUT
The REG Procedure
Model: MODEL1
Dependent Variable: NP_Flag
Number of Observations Read
Number of Observations Used
Source
Model
Error
Corrected Total
Root MSE
Dependent Mean
Coeff Var
40162
40162
Analysis of Variance
DF
Sum of
Squares
12
610.91533
40149
9332.36401
40161
9943.27934
0.48212
0.5492
87.78642
Variable
DF
Intercept
Credit_Score
%Down_Pymt_to_Loan
%Mnthly_Pymt_to_Loan
1
1
1
1
R-Square
Adj R-Sq
Mean
Square
50.90961
0.23244
F Value
Pr > F
219.02<.0001
0.0614
0.0612
Parameter Estimates
Parameter
Standard
t Value
Pr > |t|
Estimate
Error
1.24953
0.20693
6.04
<.0001
-0.000216
0.00028377
-0.76
0.4465
-0.1166
0.0117
-9.96
<.0001
0.01966
0.00517
-3.8
0.0001
Variance
Inflation
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
0
1.0205
1.09417
1.17587
29
GENERALIZED LINEAR MODELS
SAMPLE RANK ORDERING FOR LINEAR MODELS
Rankordering
7
Avg Value in a Decile
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
9
10
Decile
Avg Predicted in this Decile
Avg Actual in this Decile
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
30
MODELING DETAILS
1
Probability of Non-repayment (PD)
Logistic Model
2
Predict the $ amount at risk of Non-repayment (EAD)
OLS Model
3
Predict the % of Amount at risk that cannot be recovered**
(LGD)
Average by Risk Deciles
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
31
LOSS POST RECOVERY -NON RECOVERY RATE (%)
SAMPLE CALCULATION BY DECILES
Deciles of $ Model
1
2
3
4
5
6
7
8
9
10
Avg Non Recovery Rate (%)
50%
47%
40%
38%
37%
27%
15%
12%
10%
5%
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
32
HOW IS IT DONE?
Data Preparation
Dimensionality
Reduction
Modeling & Analysis
Validation
Recommendations &
Implementation
Strategy
● Hypotheses - Important drivers and expected relationship
● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship
● Multivariate - VIF & CI (Similar to PCA)
● Model building on Development Sample
-Identification of statistically significant drivers, Overall fit & Accuracy
● Model rebuilding on Validation Sample
-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
33
HOW IS IT DONE?
Data Preparation
Dimensionality
Reduction
Modeling & Analysis
Validation
Recommendations &
Implementation
Strategy
● Hypotheses - Important drivers and expected relationship
● Data preparation - Missing & Capping Treatment
● Bivariate - Type and Strength of the relationship
● Multivariate - VIF & CI (Similar to PCA)
● Model building on Development Sample
-Identification of statistically significant drivers, Overall fit & Accuracy
● Model rebuilding on Validation Sample
-Stability of drivers, Fit of model & Accuracy
● Framing of actionable recommendations and impact analysis
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
35
RECOMMENDATIONS
Based on Simulations/Business needs, Score buckets are created with ranges for High
Risk/Mid/Low Risk
HIGH RISK
Decline or Price at a Premium to recover maximum amount before
Non-repayment
MID RISK
Approve but charge high interest at the beginning, which can then be
negotiated to a floor value
LOW RISK
Approve and Proactive interest rate reduction/cross sell efforts with
an aim of making them come back.
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
36
DEPLOYMENT AT A 30K FEET LEVEL
Typical steps…
•
At a dealership level - negative list verification from Driving License details
•
Finance guy at the dealer - then inputs all PII information with Social Security into the “Approval” systemthe engine runs the model with the Bureau data/other model details
•
System recommends decision - yes/no and a guidance price, which then can be negotiated with Credit
executive based on the scenarios/sales/risk guidance he has.
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
37
OTHERS APPLICATIONS OF THE MODEL
Some other areas within the institution where the models outputs are leveraged…
•
Portfolio P&L estimation
Net Income from this business = Sum (all Monthly Paymentss)
- (Probability of Non-repayment*Estimated $ of Non-repayment*Loss Post
Recovery)
*In the accounting world,
1. Monthly Payments figures are “discounted” for inflation over loan time window
2. then the net income is compared against returns that the firm would have gotten if they invested the same
amount in US Government Treasury rates, to justify running this business
•
Regulatory risk reporting - BASEL norms
•
Customer bucketing for Upselling/Cross selling/Retention programs.
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
38
SIMILAR FRAMEWORK IN OTHER INDUSTRIES
Similar framework is used in other industries for solving various business problems…
•
Marketing Campaigns: e.g., find out which customer is more likely to respond to campaigns and if they do
how much $ would they spend with us
•
How many will use Friend finder on Facebook, if yes, how many invites will they send?
•
How many will see the promoted news feed? How many will they re-share it?
•
Loyalty Models (ecommerce): e.g., will a customer get engaged (Repeat purchases) and if he does how
much $ will he spend with us
•
Attrition Models (Telecom) : e.g., are we going to lose a customer and if yes, how much revenue impact is it
going to be
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
39
GOOD INFO ON LINEAR & LOGISTIC REGRESSION AT…
Linear Regression
http://faculty.chass.ncsu.edu/garson/PA765/regress.htm
Logistic Regression
http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm
Intended for Knowledge Sharing only
Intended Knowledge Sharing only
Intended for for Knowledge Sharing only
41
41