Anúncio

Risk Based Loan Approval Framework

Director, Analytics & A/B Testing at Visa. An Analytics Evangelist, Thought Leader & Keynote Speaker em Visa
14 de Nov de 2013
Anúncio

Mais conteúdo relacionado

Apresentações para você(20)

Similar a Risk Based Loan Approval Framework(20)

Anúncio

Mais de Ramkumar Ravichandran(20)

Último(20)

Anúncio

Risk Based Loan Approval Framework

  1. RISK BASED APPROVAL FRAMEWORK -Auto Loans Dec 2013
  2. CONTENTS Business Problem Methodology & Process How does the model get Deployed - 30K feet view Where else will the lender use the models? Do other industries use this framework too? References for reading materials Intended for Knowledge Sharing only 2
  3. BUSINESS PROBLEM Risk based Approval/Pricing Framework Intended for Knowledge Sharing only 3
  4. BUSINESS PROBLEM BUSINESS PROBLEM Risk based Approval/Pricing Framework 1 What are the chances of non-repayment? 2 If it happens, how much money will go bad? 3 How Business sees it? How much will I ultimately recover if I repossess and sell off the vehicle? Note: * Non-repayment is defined as payments delayed by over 180 days since the due date. Intended Knowledge Sharing only Intended for for Knowledge Sharing only 4
  5. BUSINESS PROBLEM BUSINESS PROBLEM Risk based Approval/Pricing Framework How Statisticians See it? 1 2 3 Intended Knowledge Sharing only Intended for for Knowledge Sharing only 5
  6. BUSINESS PROBLEM BUSINESS PROBLEM Risk based Approval/Pricing Framework How Analysts See it? 1 Probability of non-repayment (PD) 2 Estimated $ of non-repayment (EAD) 3 Loss Post Recovery(LGD) Intended Knowledge Sharing only Intended for for Knowledge Sharing only 6
  7. HOW IS IT DONE? First step would be to convert a business problem into Analytical Framework (Label & Inputs), followed by…. Data Preparation Dimensionality Reduction Modeling & Analysis Validation Recommendations & Implementation Strategy ● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment ● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA) ● Model building on Development Sample -Identification of statistically significant drivers, Overall fit & Accuracy ● Model rebuilding on Validation Sample -Stability of drivers, Fit of model & Accuracy ● Framing of actionable recommendations and impact analysis Intended Knowledge Sharing only Intended for for Knowledge Sharing only 7
  8. HOWEVER IT SHOULD BE PRECEDED BY SEGMENTATION Customers need to be bucketed into homogenous buckets, to normalize for inherent variation between various types of customers/products etc. Loan Term Credit Score Bands Low End Models Mid Range Models Luxury Brands Least Score Range 3 1 year Mid Score Range 1 High Score Range 4 5 Least Score Range 3 2 year Mid Score Range High Score Range 2 4 Intended Knowledge Sharing only Intended forfor KnowledgeSharingonly 8
  9. TRANSLATE INTO ANALYTICAL FRAMEWORK A model is a mathematical relationship between a “Target/Label” Variable and the “Predictor/Input” variables. Here “Non-repayment” is the “Target/Label” and application information are “Predictors/Input Variables”… Non-repayment = f {application data like Credit Score, %Monthly Payment to Income, etc.} We build models on a historical sample, i.e., where we have both application data and what happened with that application later on over the loan term…. Predictors/Input Variables Appl_ID 1 2 3 Crd_sc %Pymt_Inc 750 10% 500 70% 650 25% Customer info at the time of application Target/Labels Appl_ID NP_Flag When 1 No 2 Yes 5th Month 3 No - Modeling Data Predictors/Input Variables + Target/Labels Appl_ID Crd_sc %Pymt_Inc 1 750 10% 2 500 70% 3 650 25% NP_flag When No Yes 5th Month No - Non-repayment info over loan term Intended Knowledge Sharing only Intended for for Knowledge Sharing only 9
  10. DATA CREATION- PREDICTOR VARIABLES & HYPOTHESES DATA TYPE VARIABLES EXPECTED RELATIONSHIP Absolute values Credit Score Payment to Income Ratio Debt to Income Ratio #Inquiries in last qtr, 12 months Total Outstanding Loan Bankrupty, Non-repayments, Charge offs, etc. -ve +ve +ve +ve +ve +ve Deviations in Slope and Level Trend, Shocks, etc. -ve/+ve Total Loan Requested Term of the loan Depends -ve/+ve Depends on market demand for the Make/Model -ve New = -ve BUREAU DATA LOAN DETAILS DEMOGRAPHIC DETAILS Absolute values Absolute values MACROECONOMIC DATA Absolute values GEO DATA Absolute Values TRANSACTIONS DATA Absolute values Deviation Deviation Make/Model/Model Year of the Car Past relationship with the Lender New/Used Car Home Owner/Renter, #Dependents, Gender, Marital Status, Age,Occupation, Education, Profession GDP, Household Savings Ratio, Fuel Prices, Unemployment Rate, Interest Rates, etc. Trend, Shocks, etc. City, State, Region Cluster, Local Competition Data, Dealership level factors, etc. Monthly Payments, #Payments made, #Nonrepayments, Time to CO, Amount of Nonrepayment, Recovery Rate, etc. Trend, Shocks, etc. Depends on the variable Depends on the variable Depends on the variable Depends on the variable Depends on the variable Depends on the variable Intended Knowledge Sharing only Intended for for Knowledge Sharing only 10
  11. HOW IS IT DONE? Data Preparation Dimensionality Reduction Modeling & Analysis Validation Recommendations & Implementation Strategy ● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment ● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA) ● Model building on Development Sample -Identification of statistically significant drivers, Overall fit & Accuracy ● Model rebuilding on Validation Sample -Stability of drivers, Fit of model & Accuracy ● Framing of actionable recommendations and impact analysis Intended Knowledge Sharing only Intended for for Knowledge Sharing only 11
  12. DATA PREPARATION CAPPING & MISSING VALUE TREATMENT Capping treatment is necessary to remove the effect of extreme/non-sensical values, very different from the rest of population…. No. Pyoffflg Prin0105 Loanamt Term Fixed Agnsttr Bbctrad Nummortt Rvoptbal Numminq Missing 282 Numminq3 observations 0 1 0 2324.9 19900 360 1 21 282 1 2 0 3796.5 22100 240 0 6 6911 1 33978 1 1 3 1 12523.2 42000 360 1 1 36350 . 36732 1 1 4 0 5190.9 21760 349 1 42 885 1 911 0 0 5 1 53.6 18000 360 1 5 8851 1 9506 0 0 6 0 1256.9 15500 360 . 13 409 1 760 0 0 7 0 4403.3 25150 900 1 3 21417 5 23579 3 1 8 0 3137.2 17800 240 1 4 4528 2 5967 1 0 9 0 4256.5 9999999 360 18179 47 130683 4 1 10 0 6442.4 31200 360 33177 1 0 2 0 Unrealistic values 1 9 1 34 0 ….Missing treatment is imputation of missing values for certain variables, and is mandatory. If left unattended, entire record is excluded from Modeling. Intended Knowledge Sharing only Intended for for Knowledge Sharing only 12
  13. HOW IS IT DONE? Data Preparation Dimensionality Reduction Modeling & Analysis Validation Recommendations & Implementation Strategy ● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment ● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA) ● Model building on Development Sample -Identification of statistically significant drivers, Overall fit & Accuracy ● Model rebuilding on Validation Sample -Stability of drivers, Fit of model & Accuracy ● Framing of actionable recommendations and impact analysis Intended Knowledge Sharing only Intended for for Knowledge Sharing only 13
  14. DIMENSIONALITY REDUCTION BIVARIATE ANALYSIS Bivariate analysis explores the nature and degree of relationship between the independent and dependent variables…. • Rank Plots: Checks if the predictor variables correlate with Target variable. Steps: • Sort the population by predictor variable values • Split into groups with equal number of obs, generally ten groups or deciles • Get the average of Target variable in each group • Check if there is a trend in average value of Target variables from the top group to bottom Dummy = (predictor value<=2) 30 Avg Target Avg Target 50 40 30 20 25 No relationship 20 15 10 5 10 0 0 0 1 2 Predictor Deciles 3 4 40 60 80 Predictor Deciles 100 ..it not only helps in finding related predictors, predictor transformations, it also helps in dimensionality reduction Intended Knowledge Sharing only Intended for for Knowledge Sharing only 14
  15. DIMENSIONALITY REDUCTION MULTIVARIATE ANALYSIS Two metrics that are predominantly used are Variance Inflation Factor (VIF) and Conditional Index (CI)…. Variance Inflation factor (VIF) VIF is obtained by regressing each independent variable, say X on the remaining independent variables (say x1 and x2) and checking how much of it (of X) is explained by these variables. ->Cut-offs used vary from 2 to 10 Conditional Index (CI) Conditional Index is the square root of the ratio of the highest eigen value (λmax) and individual eigen value (λ). ->Cut-offs used vary from 13 to 30 Very similar to Principal Component Analysis (PCA) Intended Knowledge Sharing only Intended for for Knowledge Sharing only 15
  16. GENERALIZED LINEAR MODELS SAMPLE VIF/CI OUTPUT The REG Procedure Model: MODEL1 Dependent Variable: NP_Flag Number of Observations Read Number of Observations Used Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var 40162 40162 Analysis of Variance DF Sum of Squares 12 610.91533 40149 9332.36401 40161 9943.27934 0.48212 0.5492 87.78642 Variable DF Intercept Credit_Score %Down_Pymt_to_Loan %Mnthly_Pymt_to_Loan 1 1 1 1 Number 1 2 3 8 9 10 11 12 13 Eigenvalue R-Square Adj R-Sq Mean Square 50.90961 0.23244 F Value Pr > F 219.02<.0001 0.0614 0.0612 Parameter Estimates Parameter Standard t Value Pr > |t| Estimate Error 1.24953 0.20693 6.04 <.0001 -0.000216 0.00028377 -0.76 0.4465 -0.1166 0.0117 -9.96 <.0001 0.01966 0.00517 -3.8 0.0001 Collinearity Diagnostics Condition Index Intercept 8.3631 1.01345 0.96895 0.22138 0.20341 0.05087 0.02578 0.00137 0.00007104 1 2.87264 2.93787 6.14626 6.41212 12.82208 18.01153 78.10783 343.097 0.00000188 8.65E-09 2.42E-11 0.00000754 0.00001611 0.00000322 0.00082432 0.01375 0.98539 Variance Inflation 0 1.0205 1.09417 1.17587 Proportion of Variation Credit_Score %Down_Pymt_to %Mnthly_Pymt_ _Loan to_Loan 0.00000202 0.00002708 0.00057815 8.73E-09 1.04E-07 5.68E-06 5.60E-14 1.68E-09 0.0000019 0.00000817 0.00009252 0.00396 0.00001745 0.00020511 0.01911 0.00000279 0.00011988 0.26143 0.00088072 0.00992 0.68574 0.01859 0.96941 0.02085 0.98048 0.02008 0.00000173 Intended Knowledge Sharing only Intended forfor KnowledgeSharingonly 16
  17. HOW IS IT DONE? Data Preparation Dimensionality Reduction Modeling & Analysis Validation Recommendations & Implementation Strategy ● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment ● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA) ● Model building on Development Sample -Identification of statistically significant drivers, Overall fit & Accuracy ● Model rebuilding on Validation Sample -Stability of drivers, Fit of model & Accuracy ● Framing of actionable recommendations and impact analysis Intended Knowledge Sharing only Intended for for Knowledge Sharing only 17
  18. MODELING DETAILS 1 What are the chances of Non-repayment? Probability of Non-repayment (PD) 2 If it happens, how much money will go bad? Predict the $ amount at risk of Non-repayment (EAD) 3 How much will I ultimately recover if I repossess and sell off the vehicle? Estimate the % of Amount at risk that cannot be recovered (LGD) Intended Knowledge Sharing only Intended for for Knowledge Sharing only 18
  19. MODELING DETAILS 1 Probability of Non-repayment (PD) Logistic Model 2 Predict the $ amount at risk of Non-repayment (EAD) OLS Model 3 Predict the % of Amount at risk that cannot be recovered** (LGD) Average by Risk Deciles Intended Knowledge Sharing only Intended for for Knowledge Sharing only 19
  20. SAMPLING Modeling Sample (50%) Model development Full Applications data from analysis time window Testing Sample (50%) Model validation Model Validation on data from another time window Intended Knowledge Sharing only Intended for for Knowledge Sharing only 20
  21. MODELING DETAILS 1 Probability of Non-repayment (PD) Logistic Model 2 Predict the $ amount at risk of Non-repayment (EAD) OLS Model 3 Predict the % of Amount at risk that cannot be recovered** (LGD) Average by Risk Deciles Intended Knowledge Sharing only Intended for for Knowledge Sharing only 21
  22. LOGISTIC REGRESSION What is a Logistic Model? ->Predicts log odds(event/non-event) ->Predictive Model is as a mathematical relationship between the predictors and Target Log (odds) = α + β1X1 + β2X2 SAS procedure: Proc Logistic (with various link functions) Intended Knowledge Sharing only Intended for for Knowledge Sharing only 22
  23. HOW TO FIND IF A METHOD WORKS? For Logistic Models, following metrics are used as Performance diagnostics… • Concordance/Discordance: Overall indicator of the model prediction accuracy • Pair all observations randomly • Check the %pairs where the “bad” guy is given higher probability vs. the “good” guy • Rank Order: Similar test like above, but a more structured format Steps: • Sorting: Sort the population by predicted probability • Deciling: Bucket them into ten groups, each having 10% of the population in the sorted order • Check the %Non-repayment guys in each decile • Capturing: Ideally %bad guys should be highest in top deciles and lowest in bottom deciles. Top deciles should capture most of the Non-repayment guys. • Gains Chart: Graphical representation of capturing by the model and performance against random bucketing. • Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum information capture with least number of predictors. …apart from usual checks on Signs, Statistical Significance and if the model holds in the validation samples also Intended Knowledge Sharing only Intended for for Knowledge Sharing only 23
  24. SAMPLE MODEL OUTPUT Effect APPLICATION_PRIM_CB_ %Down_Pymt_to_Loan %Mnthly_Pymt_to_Loan Type 3 Analysis of Effects DF Wald Chi-Square 2 14.5230 2 126.6605 2 83.5880 Effect APPLICATION_PRIM_CB_ %Down_Pymt_to_Loan %Mnthly_Pymt_to_Loan 0.0007 <.0001 <.0001 Analysis of Maximum Likelihood Estimates DF Development Validation Standard Model Estimate Model Estimate Error 1.1321 -0.4085 0.8909 -0.00349 -0.00220 0.00122 -0.3934 -0.2839 0.0485 0.1206 -0.0900 0.0221 Parameter Intercept APPLICATION_PRIM_CB_ %Down_Pymt_to_Loan %Mnthly_Pymt_to_Loan Pr > ChiSq 1 1 1 1 outcome Odds Ratio Estimates Point Estimate 1 0.998 1 0.753 1 0.914 Wald Chi-Square 0.2102 3.2494 34.2834 16.5920 Pr > Chi Sq 0.6466 0.0715 <.0001 <.0001 95% Wald Confidence Limits 0.995 1.000 0.685 0.828 0.875 0.954 Percent Concordant 65.9 Somers' D 0.338 Percent Discordant 32.1 Gamma 0.345 Percent Tied 2.0 Tau-a 0.074 c 0.669 Pairs 1806529536 Higher the percent concordant, better the model Intended Knowledge Sharing only Intended for for Knowledge Sharing only 24
  25. SAMPLE GAINS CHART 120 Model capturing 100 Higher the capturing in the initial deciles, better the model performance 80 Responders captured 60 40 Random capturing 20 0 0 20 40 60 80 100 Population (%) Intended Knowledge Sharing only Intended forfor KnowledgeSharingonly 25
  26. MODELING DETAILS 1 Probability of Non-repayment (PD) Logistic Model 2 Predict the $ amount at risk of Non-repayment (EAD) OLS Model 3 Predict the % of Amount at risk that cannot be recovered** (LGD) Average by Risk Deciles Intended Knowledge Sharing only Intended for for Knowledge Sharing only 26
  27. OLS MODELS What is a Linear Model? ->Predicts the value of the Target variable ->Predictive Model is as a mathematical relationship between the predictors and Target y =α + β1X1 + β2X2 * These models are developed only on the “bad” population, since including “good” will skew the model. SAS procedure: Proc Reg Intended Knowledge Sharing only Intended for for Knowledge Sharing only 27
  28. HOW TO FIND IF A METHOD WORKS? For Linear Models, following metrics are used as Performance diagnostics… • R-square: Tells how much of the variance in “Target” variable is captured by the model. • Error rate(%): Tells what is the error relative to actual values of Target variable. Error rate (%) = average of square(actual – predicted)/average of actuals • Rank Order: Checks if the predicted values correlate with actual values. Steps: • Sorting: Sort the population by predicted values • Deciling: Bucket them into ten groups, each having 10% of the population in the sorted order • Check the average value of prediction in each decile and average value of actuals in each deciles • Check if both averages are gradually decreasing from the top group to bottom • Akaike Information Criteria(AIC): Helps in selecting the most “parsimonious” regression models- maximum information capture with least number of predictors. …apart from usual checks on Signs, Statistical Significance and if the model holds in the validation samples also Intended Knowledge Sharing only Intended for for Knowledge Sharing only 28
  29. SAMPLE MODEL OUTPUT The REG Procedure Model: MODEL1 Dependent Variable: NP_Flag Number of Observations Read Number of Observations Used Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var 40162 40162 Analysis of Variance DF Sum of Squares 12 610.91533 40149 9332.36401 40161 9943.27934 0.48212 0.5492 87.78642 Variable DF Intercept Credit_Score %Down_Pymt_to_Loan %Mnthly_Pymt_to_Loan 1 1 1 1 R-Square Adj R-Sq Mean Square 50.90961 0.23244 F Value Pr > F 219.02<.0001 0.0614 0.0612 Parameter Estimates Parameter Standard t Value Pr > |t| Estimate Error 1.24953 0.20693 6.04 <.0001 -0.000216 0.00028377 -0.76 0.4465 -0.1166 0.0117 -9.96 <.0001 0.01966 0.00517 -3.8 0.0001 Variance Inflation Intended Knowledge Sharing only Intended for for Knowledge Sharing only 0 1.0205 1.09417 1.17587 29
  30. GENERALIZED LINEAR MODELS SAMPLE RANK ORDERING FOR LINEAR MODELS Rankordering 7 Avg Value in a Decile 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 Decile Avg Predicted in this Decile Avg Actual in this Decile Intended Knowledge Sharing only Intended for for Knowledge Sharing only 30
  31. MODELING DETAILS 1 Probability of Non-repayment (PD) Logistic Model 2 Predict the $ amount at risk of Non-repayment (EAD) OLS Model 3 Predict the % of Amount at risk that cannot be recovered** (LGD) Average by Risk Deciles Intended Knowledge Sharing only Intended for for Knowledge Sharing only 31
  32. LOSS POST RECOVERY -NON RECOVERY RATE (%) SAMPLE CALCULATION BY DECILES Deciles of $ Model 1 2 3 4 5 6 7 8 9 10 Avg Non Recovery Rate (%) 50% 47% 40% 38% 37% 27% 15% 12% 10% 5% Intended Knowledge Sharing only Intended for for Knowledge Sharing only 32
  33. HOW IS IT DONE? Data Preparation Dimensionality Reduction Modeling & Analysis Validation Recommendations & Implementation Strategy ● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment ● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA) ● Model building on Development Sample -Identification of statistically significant drivers, Overall fit & Accuracy ● Model rebuilding on Validation Sample -Stability of drivers, Fit of model & Accuracy ● Framing of actionable recommendations and impact analysis Intended Knowledge Sharing only Intended for for Knowledge Sharing only 33
  34. SAMPLING Modeling Sample (50%) Model development Full Applications data from analysis time window Testing Sample (50%) Model validation Model Validation on data from another time window Intended Knowledge Sharing only Intended for for Knowledge Sharing only 34
  35. HOW IS IT DONE? Data Preparation Dimensionality Reduction Modeling & Analysis Validation Recommendations & Implementation Strategy ● Hypotheses - Important drivers and expected relationship ● Data preparation - Missing & Capping Treatment ● Bivariate - Type and Strength of the relationship ● Multivariate - VIF & CI (Similar to PCA) ● Model building on Development Sample -Identification of statistically significant drivers, Overall fit & Accuracy ● Model rebuilding on Validation Sample -Stability of drivers, Fit of model & Accuracy ● Framing of actionable recommendations and impact analysis Intended Knowledge Sharing only Intended for for Knowledge Sharing only 35
  36. RECOMMENDATIONS Based on Simulations/Business needs, Score buckets are created with ranges for High Risk/Mid/Low Risk HIGH RISK Decline or Price at a Premium to recover maximum amount before Non-repayment MID RISK Approve but charge high interest at the beginning, which can then be negotiated to a floor value LOW RISK Approve and Proactive interest rate reduction/cross sell efforts with an aim of making them come back. Intended Knowledge Sharing only Intended for for Knowledge Sharing only 36
  37. DEPLOYMENT AT A 30K FEET LEVEL Typical steps… • At a dealership level - negative list verification from Driving License details • Finance guy at the dealer - then inputs all PII information with Social Security into the “Approval” systemthe engine runs the model with the Bureau data/other model details • System recommends decision - yes/no and a guidance price, which then can be negotiated with Credit executive based on the scenarios/sales/risk guidance he has. Intended Knowledge Sharing only Intended for for Knowledge Sharing only 37
  38. OTHERS APPLICATIONS OF THE MODEL Some other areas within the institution where the models outputs are leveraged… • Portfolio P&L estimation Net Income from this business = Sum (all Monthly Paymentss) - (Probability of Non-repayment*Estimated $ of Non-repayment*Loss Post Recovery) *In the accounting world, 1. Monthly Payments figures are “discounted” for inflation over loan time window 2. then the net income is compared against returns that the firm would have gotten if they invested the same amount in US Government Treasury rates, to justify running this business • Regulatory risk reporting - BASEL norms • Customer bucketing for Upselling/Cross selling/Retention programs. Intended Knowledge Sharing only Intended for for Knowledge Sharing only 38
  39. SIMILAR FRAMEWORK IN OTHER INDUSTRIES Similar framework is used in other industries for solving various business problems… • Marketing Campaigns: e.g., find out which customer is more likely to respond to campaigns and if they do how much $ would they spend with us • How many will use Friend finder on Facebook, if yes, how many invites will they send? • How many will see the promoted news feed? How many will they re-share it? • Loyalty Models (ecommerce): e.g., will a customer get engaged (Repeat purchases) and if he does how much $ will he spend with us • Attrition Models (Telecom) : e.g., are we going to lose a customer and if yes, how much revenue impact is it going to be Intended Knowledge Sharing only Intended for for Knowledge Sharing only 39
  40. APPENDIX Intended Knowledge Sharing only Intended for for Knowledge Sharing only 40
  41. GOOD INFO ON LINEAR & LOGISTIC REGRESSION AT… Linear Regression http://faculty.chass.ncsu.edu/garson/PA765/regress.htm Logistic Regression http://faculty.chass.ncsu.edu/garson/PA765/logistic.htm Intended for Knowledge Sharing only Intended Knowledge Sharing only Intended for for Knowledge Sharing only 41 41
Anúncio