This project aims at predicting Defaulters of Credit Card Payment. R programming is used for Exploratory Data Analysis and for Model building R programming and Azure ML is used.
DBA Basics: Getting Started with Performance Tuning.pdf
Default payment prediction system
1. Default Payment Prediction System
Data Analysis and Predictive Analysis – R Programming and Azure ML
ASHISH ARORA
2. Introduction and Problem
• Banks plays a significant role in providing
financial services to help people and
business to achieve their goals as well as
reach their potential.
• To keep the integrity Bank must avoid in
investing wrong customers who can default
and cause loss to the Financial Institution.
3. Purpose and Process
• To build a predictive model that can be used to
help the Banks use their data efficiently to
make better decisions.
• A predictive analytics application allows the
banks and other financial institutions to
identify the risks and address them in real time
to reach better outcomes.
• Bank must able to analyze available data
related to the customers before making the
decision of issuing credit card.
• The model developed will use all possible
factors and data to predict whether the
customer would fail or succeed in making the
next payment with a rational accuracy. It would
benefit the bank before they make any
decisions against that customers. The target is
to minimize the risk of having loan loss.
4. Data Set
• https://archive.ics.uci.edu/ml/
datasets/default+of+credit+ca
rd+clients
• 30000 rows
• Features in dataset = 25
• This dataset contains
information on default
payments, demographic
factors, credit data, history of
payment, and bill statements
of credit card clients in Taiwan
from April 2005 to September
2005.
• There are no missing data.
5. R Code – Description
And Results
• # Read the .csv file in R
envorinment
• creditcarddata <-
read.csv("default of credit card
clients.csv")
• dim(creditcarddata)
6. Data Set Summary
• There are two key variable categories in the
dataset.
• Nominal variables include sex, education,
marriage, repayment statuses (PAY_X), etc.
• Numeric variables contains age, amount of
given credit (LIMIT_BAL), amount of bill
statements (BILL_AMT), and amount of
previous payments (PAY_AMT).
• The class variable (y) indicates whether that
customer had default payment the next
month or not. If yes, it is labeled 1,
otherwise, set to 0.
7. Structure of Data
Before Adding new
variables and
Tidying the Data
• This is the structure of Data
before reshaping and
cleaning step.
• New Variables can be created
to give more possibility of
predicting defaulters.
• SEX, EDUCARION and
MARRIAGE variable can be
converted from integer to
categorical data.
8. Structure of Data after
adding new variables
• 4 new columns are added to
make data set more
meaningful.
• The new columns being added
are work_status,
education_cat, MARRIAGE_cat
and SEX_cat
9. Reshaping the Data
• Reshaping the Data by converting
Quantitative Variables To New Factorial
Variables
• Factors are categorical variables that are
super useful in summary statistics, plots, and
regressions. They basically act like dummy
variables that R codes for you.
• Removing Variables which are not useful for
analysis.
• Variables removed from dataset are
PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6.
11. Exploring Data Via
Basic Visualization
• There are more female than male in the
dataset.
• There are clients who finished university-
level education.
• There are more single client than married,
but the number is quite closed.
• More Clients are employed.
13. Determining
Balance Limit
Variability By
Factors of Gender,
Education and
Work State
• After creating box plots it is evident that gender has
no effects on determining balance limits by bank.
• Education level and Work Status are the most
important factors which are being considered by
banks to determine balance limits.
14. Relationship Between Marital Status &
Balance Limits Categorized By Gender
• By this graph, we can observe
that, there is no change for
females , balance limits
depending on their marital
status remains almost same
for both conditions either
married or single, however it
changes a lot on males side
maybe because of extra
expenditures which is the
reason on increased balance
limits.
15. Relationship between Limit
Balance & Default Payment
• Balance limits and count of
defaulted clients are almost
same for University and
Graduate Level. Additionally,
the ratio of defaulted clients at
high school level seems almost
the same as the university and
graduate levels.
16. Balance Limits By Age
Groups & Education
• This box plots shows that the
Balance Limit for higher Age
Group individuals are
increasing based on their
education status.
17. Correlations Between Limit Balance,
Bill Amounts & Payments
• This correlation plot shows us
that there is a low correlation
between the limit balances
and payments and bill
amounts. However it can be
seen that bill amounts has
high correlation between each
other as expected since the
bills are reflecting the
cumulative amounts.
18. Is there any
variability in
defaulting payment
next month based
on gender,
education and
martial status ?
• It seems that more males seems to default payment
and in case of education more clients with high
school as their last degree defaults payment.
• Martial Status of client doesn't show any variability.
19. Model Building
• This section is to start building
the model for predicting the
default payment outcome.
• Before building the model the
dataset was divided in training
and test data set.
• Train Data Set = 70%
• Test Data Set = 30%
20.
21. Model Building Using Azure ML
• The Model is trained using Two-Class Decision Forest.
22. The classification matrix or
the confusion matrix
• This classifies our predictions as false positive, false negative, and so on.
• True Positive = The true positives are where the actual value is 1, so in other words, they defaulted and
the predicted value is also 1.
• False Positive = The false positive is where the predicted value is a 1, but the actual value is a 0. Okay, so
we predicted a positive, but we were wrong about it. That's why it's a false positive, so we predicted they
would default, they did not.
• False Negative = The false negative is where we predicted they would not default, and they defaulted.
• True Negative = True negative is where we predicted negative, we predicted they would not default, and
they did not default, okay.
• Accuracy = What Percent out of total test data set population is being predicted correctly.
• Accuracy = (TP+TN)/(TOTAL) = (662+6734)/(1329+275+662+6734) = 0.82
• Precision = how precise was your prediction?
• When you predicted default, how likely are you to be correct?
• Precision = TP / TP + FP = 662 / 662 + 275 = 0.707
• Recall = Out of the Total population, what fraction of population you correctly predicted who will
defaulted.
• Recall = 662 / 662 + 1329 = 0.332
23. Conclusion
• This project involves prediction of defaulters for
Credit Card Bank Customers.
• R programming is used for Exploratory Data
Analysis and Visualization.
• R and Azure ML is used for Model Building using
Logistic Regression and Two Class Decision
Forest Algorithim.