Enhancing and Restoring Safety & Quality Cultures - Dave Litwiller - May 2024...
Bank market classification
1. Classification of Bank Marketing Dataset
using Decision Tree Induction
Sunil Kumar P (A13020)
Maruthi Nataraj K (A13009)
Praxis Business School , Kolkata
31-Oct-2013
3. Introduction
Problem Statement
Marketing campaign strategy of XYZ International Bank
Increase in the number of marketing campaigns
Economic pressure and Competition
Product promotion
- Mass campaigns
- Directed marketing
Reduction in cost and time
Improvement in efficiency
- Less contacts , more successes
4. Objective
Classify the potential customers
- Capable of subscribing to the Term Deposit projected to them
Decision Tree Algorithm
- Rules based on some criteria or
characteristics of customer
5. Bank Marketing Dataset
The data is related with direct marketing campaigns of a Portuguese banking institution.
# of Instances - 4521
# of Attributes - 16 + Output attribute
Campaign Window : May – Nov (Attractive Term Deposits with good interest rates.)
6. Bank Marketing Dataset
Class distribution (y) - No (88.48%) Yes (11.52%)
Missing attribute values - None
8. Attribute Selection – Most IG
Expected information needed to classify a tuple in Training set - 0.515522 bits
(ID3 measure)
Rank
Attribute
Information Gain
1
duration
0.072523
2
poutcome
0.037581
3
job
0.009991
10. Evaluation – Confusion Matrix (Test data)
yes
TP
61
FP
50
P'
111
Actual
yes
no
Predicted
no
FN
106
TN
1140
N'
1246
Accuracy
(Recognition Rate)
=TP+TN/P+N
0.885041
Error Rate
(Misclassification rate)
=FP+FN/P+N
0.114959
Sensitivity(TPR)
Recall
=TP/P or (TP/TP+FN)
0.365269
Specificity(TNR)
=TN/N or (TN/FP+TN)
0.957983
Precision
=TP/TP+FP
0.549550
F Score
=2*Prec*Recall/
Prec+Recall
0.438849
P
167
N
1190
1357
Case of class
Case of class
imbalanced
imbalanced
data with only
data with only
11.52% as
11.52% as
“Yes”
“Yes”
What % of +ve
What % of +ve
tuples are labeled
tuples are labeled
as such
as such
What % of
What % of
tuples labeled
tuples labeled
as +ve are
as +ve are
actually as
actually as
such
such
12. Evaluation – ROC
Area under the ROC Curve - 0.7992
Larger the area , better is the model
13. Problems
Missing values
Pruning (noise/outliers)
Unbalanced dataset
- Bias in prediction
- Over fitting / under fitting
(Too many/Too few variables in test set)
14. Conclusions
The Bank should target the potential customers who have spent considerable
amount of time responding to the bank call with the duration ranging from 212
seconds to 638 seconds and also who have responded positively during the
previous campaign(2%) which comes at the cost of 75% hit rate.
The Bank can also aim at the customers for whom the duration of call is more
than 802 seconds(4%) with 60% hit rate as there is likely chance that the
respective customer is genuinely interested in the deposit product.
Other set of potential customers are with call duration ranging from 638
seconds to 802 seconds(1%) and who fall into the job category of housemaid,
services, technician etc as these set of people are averse to taking risks and look
for safe deposit of their savings with fixed returns(62% hit rate)
We would go ahead with further analysis which can lead to the profitability of
the client’s business.
15. Future Direction
The overall accuracy of the classifier needs to be increased
• Use of Ensemble Methods for improving accuracy
- Bagging
- Boosting
- Random Forests
Strategy for class imbalance problem(Ex: 1000 N 100 Y)
- Over sampling
- Under sampling etc
Experimenting with other classification methods like Naïve
Bayesian, Rule based classification etc.
16. References
Paper on “Bank Direct Marketing Using Rule Based
Classification”
Paper on “A Comparison of Different Classification Techniques
for Bank Direct Marketing”
Classification PPT - Dalhousie University
Dataset - UCI repository
(http://archive.ics.uci.edu/ml/datasets/Bank+Marketing)