Neural Network Classification and its Applications in Insurance Industry
1. COMP 7570 –Neural Networks Project Report
Neural Network Classification and its Applications in Insurance Industry
Inderjeet Singh
7667292
Department of Computer Science
University of Manitoba
December 8, 2011
2. Abstract
Neural networks when used for classification also known as neural classifiers have many
advantages. Extracting rules from these trained networks is a hard task. Research has been
done in this regard. [Lu] generated a method of extracting the rules from neural networks and
advocated the use of neural networks in the process of classification and data mining in
general. [Smith] did a case study of the use of neural networks for customer retention in the
insurance industry. They discussed the importance of predicting the patterns of the customer
terminations for gaining profit in this highly competitive industry. [Viaene] deployed neural
networks for predicting the claim frauds in automobile insurance industry. Input (fraud
indicators) relevance is important for detecting the claim frauds. They used neural networks
(MLP-ARD) to produce the fraud indicators importance rankings for automobile insurance
industry.
1. Introduction
Neural Networks [Scuse] are models of intelligence that consist of large numbers of simple
processing units also known as neurons or nodes that collectively are able to perform very
complex pattern matching tasks. These models perform stimulus response (input-output)
mapping. Classification which is a branch of data mining, [Wiki] is the process of learning rules
or models from training data to generalize the known structure and then to classify new data
with these rules.
2
3. Normally in data mining field classification happens with the help of decision tree algorithms
and logistic regression. These days’ neural networks are also used as one of the approaches for
classification. Classification with neural networks is a popular area of research. It has gained a
lot of attention specifically in the field of data mining where the volume of data is too large to
handle.
Neural networks when used for classification have many advantages. They are data driven, self-
adaptive. They can approximate any complex function with high accuracy. They can be used to
make non-linear models which can model real world applications with high accuracy. They are
also tolerant to noisy data. Neural classifiers have problems as well. They usually lack
transparency and have black box behaviour; there learning or training time is long which
depends upon many repeated epochs cycles over the training data. Also, extracting
classification rules [Lu] is difficult from neural networks because of their complex and
incomprehensible structure with too many links between input, hidden and output units.
Neural networks have already been used in real world application such as bankruptcy
prediction, credit scoring, quality control, insurance industry, handwriting recognition and
many more. In this report I will focus specifically of their application in insurance industry.
Insurance industry is a very competitive industry. The success of an insurance company
depends upon the profit and growth. Profit depends upon various factors. Predicting the
average claim cost, frequency of claims and to examine the effect of change in prices of policies
or premium cost on the customer retention [Smith] is critical for profit. Neural classification has
3
4. been applied in this regard to learn and predict if a customer will terminate or renew his policy.
Claim fraud is another important issue in this industry. Companies are facing huge losses of
money from fraudulent claims made by the insurers. They are looking for solutions for fraud
claim prediction and diagnosis. Neural classification [Viaene] help to know which fraud
indicators or inputs are most crucial for predicting fraudulent claims. Both the above uses of
neural networks used different version of multilayer feed forward neural networks.
2. Extracting Symbolic Classification Rules from Neural Networks
In this work, [Lu] is focussed on mining classification rules from large databases with the help of
neural networks. Neural network approach has advantages like, low classification error rate and
robustness to noise.
The neural network based classification approach described by them consists of three phases:
The first phase is network construction in which a three layer feed forward neural network is
constructed. The method for creating neural network is inspired by [Sentiono 1995] method of
dynamically creating the network. Network creation starts from a single hidden unit and then
dynamically adds hidden units to network until network completely classifies all the input
patterns correctly. Rather than minimizing the sum of squares of errors [Sentiono 1995]
maximizes the likelihood function. Also, unlike back propagation method this method does not
get stuck in local minima.
The second phase is network pruning. In network pruning the penalty function [Sentiono] is
added to error function that helps to prune the network by weight removal. The penalty
4
5. function used in the above approach is sum of squared weights. While pruning the network the
classification error rate should not increase. The first objective while pruning is to discourage
nonessential connections and second is to prevent the connection weights from attaining large
values. Removing unnecessary weights from the network reduces the networks complexity.
The last phase is the rule extraction from the pruned network. Extracting rules in not easy, as
the number of links from the pruned networks is still too much to define the explicit
relationship in terms of if-then-else rules. Also, it is difficult to derive clear relationship between
continuous activation values of hidden units and output units. Rule extraction from a pruned
network consists of four steps: 1. Use of clustering algorithm to find clusters of hidden units’
activation values. 2. Enumerating the hidden unit activation values and computing the outputs.
Generate the rules that describe the network output in terms of the hidden unit activation
values. 3. For every hidden unit enumerate the input values that lead to them and generate the
set of rules that describes the hidden values activation values in terms of input units. 4. Last
step is to merge the rules obtained in previous two steps to obtain rules that map inputs to
output.
They explained their approach of rule extraction on one of the 10 classification problems or
functions used earlier in research. They chose function 3 to demonstrate their approach.
Function 3 looks is shown in Figure 1.
5
6. Figure 1 : Function 3 [Lu]
To solve this classification problem represented by function 3 they created the neural network
as described in network creation phase above. They used the people database consisting of
nine inputs such as salary, commission, age, elevel, car, zip code, house-value, house-years and
loan and one output representing the class. Input tuple can belong to group A or group B. The
inputs were represented as binary string of 0 and 1’s. The respective bits of the input string are
0 and 1 depending upon where subinterval the value of input is located. With the above binary
scheme for inputs there were a total of 37 binary inputs units (shown in Fig 11), values of 9
inputs plus one input unit for bias making a total of 38 units. The non-pruned network consisted
of six hidden units and one output units. Therefore, it consists of 234 links. The training dataset,
they used has 2000 tuples of these inputs. Network pruning is performed as described above
giving a much simpler network as shown in Fig 2. This pruned network only consists of two
hidden units and six input units.
6
7. Figure 11: Coding of the attributes of the neural network inputs [Lu]
Before extracting the rules from pruned network shown in Fig 2, all the four steps described
above are executed. The activation values of its two hidden units are clustered. The clusters are
centered on 0.46 and 0.81. This results in two clusters of discretized activation values. For first
hidden unit, input tuples are split into two groups one with activation values of [-1, 0.46) and
other with values between [0.46, 1). For second hidden unit, input tuples are split in same way
in groups of [-1, 0.81) and [0.81, 1). The activation value of patterns for two hidden units j=1 or
2 is represented by =1 or 2. Value of = 1 on a hidden unit means that input tuple belongs
to group A and value of =2 means input tuple belongs to group B. For input to be classified in
group A, either or should be equal 1, otherwise input is classified in group B.
To generate rules for each hidden unit that do not involve weights [Lu] used the X2R algorithm
they developed earlier. The rules they got for the two hidden units are combined to give the
rules for final output in term of inputs units. For function 3, they extracted a total of 5 rules
with a total of 10 conditions from the pruned network. These rules are shown below. The rules
7
8. they got can then be expressed in terms of actual input attributes of age and elevel for function
3.
Else Default rule. Group B
For evaluation and analysis, they compared their approach of extracting rules from neural
networks with the decision tree classifier (C4.5) approach. Test for the neural network
classification was done on eight functions similar to function 3 described above. Random
number generation was used to develop the dataset for testing the rules generated for
different functions or classification problems. They used three fold cross validation to estimate
the classification accuracy of the generated rules. Fig 3 shows the results they got after
evaluation of the quality of rules generated by neural networks for different functions. They
found that neural classifiers generate much fewer rules than decision trees algorithm C4.5,
shown in Fig 4. The accuracy and number of conditions per rule for different functions were
comparable for both appraoches.
They concluded that efforts can be made to make neural classifiers training fast. In this regard,
they suggested incremental training and rule extraction from the database.
8
9. Figure 2: Pruned network for Function 3 [Lu]
Figure 3: Averages of accuracy rates, the number of rules and the average conditions per rule obtained
[Lu]
9
10. Figure 4: The number of rules extracted from neural networks (NN) and C4.5 algorithm (DT) [Lu]
3. Neural Network Applications in Insurance Industry
3.1. An Analysis and Prediction of Customer Retention Patterns and Pricing
The problem of concern in insurance industry is to set the pricing to match the claim costs and
yet to retain the existing customers and also acquire new ones. There have been a lot of
research in this regard, but due to competitiveness of this industry hardly any result or methods
to solve the above problem gets published.
10
11. In this case study, [Smith] works on structured problem of customer retention modelling using
regression, decision trees and neural networks also known as supervised learning methods. The
methods are used to learn the relationships between variables (inputs) and decisions (outputs).
They also study, the unstructured problem of analysis of claim patterns using clustering which is
an unsupervised learning method. In this report, I will discuss more about the first problem of
customer retention using neural classification, which is the main focus of this project
Growth of an insurance company depends upon attracting new customers and retaining the
existing ones. The renewal or termination of policy by customer depends upon premium price,
service, personal preference, insured amount, convenience and many other factors. The
analysis of customer retention in this case study involves two goals: First, to know the reasons
of policy termination and second, to develop a tool (based on neural classifier) for predicting
the likely policy termination. This tool will help in analyzing the impact of changes of premium
costs of policies on the likely terminations of customers. Identifying the likely policy terminating
customers can aid in the direct marketing campaigns.
To analyze the customer retention patterns, [Smith] obtained the data of 20914 auto policy
holders whose policies are going to expire in April 1998. The dataset included details such as
demographic information (age group, postcode .etc.), policy details (premium, sum insured
etc.) and policy holder history (rating, years on rating, claim history, etc.) as shown in Fig 5
below. Among this dataset, 7.1% of policy holders did not renewed their policies and their
policies terminated. Through meetings with insurance company [Smith] found that, premium
price and sum insured played a major factor in likely policy terminations.
11
12. They used the SAS Enterprise Miner software for evaluation. SAS Enterprise Miner is widely
known GUI based commercial software for applying data mining techniques. The setup for this
particular experiment involves different levels. At the first level is data processing (variable
selection, data transformation and data partitioning), then second level is application of data
mining techniques (clustering, regression, decision trees, and neural networks) and last level is
the analyses (assessment, bar charts). The process flow diagram is shown in Fig 6. In data
transformation they normalized and log transformed the variables. After transformation is
applied, they got a total of 29 independent inputs and one output (dependent variable or
termination yes or no decision), shown in Fig 5.
Regression, decision tree and neural network (available in SAS software) methods were used
for making three separate classification models or classifiers. These classifiers will predict the
likely terminations or renewals of policies. Three layer multilayer feed forward neural network
with 29 inputs units, 25 hidden units and single output unit is used. The units used hyperbolic
tangent activation function. Default learning rule which uses multiple Bernoulli error function is
used. The error is minimized by using a conjugate gradient technique and by changing the
weights.
All three methods are executed on the test set to classify the likely terminating policies. The
test set consists of 20% of entire dataset and is ranked in descending order of the likelihood of
policy holders terminating their policy. Fig 7 shows the lift chart comparing the performance of
all three methods in classifying the policy holders as terminating. Lift chart measures the
effectiveness of the predictive model and the area under the lift curve indicates how accurate
12
13. the predictive model is. X-axis in chart depicts the percentage of the policy holders selected
from the ranked list of test set and Y-axis depicts the percentage of likely terminating
customers from the percentage policy holder selected above. As can be seen in Fig 7 the white
line or lift curve representing neural networks has the largest area which means it classifies
most of the terminating policies. If only 10% of the policy holders are selected and ranked in
order of likely terminations predicted by the neural network model, 50% of the predicted
terminations are correct. With regression and decision tree this accuracy is only 40% and 28%.
Effect of decision threshold on the number of policies classified as terminated by the network is
also determined. If this decision threshold is set to 0.5, the policy is classified as terminated if
likelihood or probability of a policy predicted by neural network is above 0.5. It is observed that
setting a low value of 0.1 for this decision threshold helps in predicting all likely terminations.
Marketing mails can be sent out to these likely terminations, to help them renew their policies.
But low decision threshold results in loss of accuracy in predicting terminations. It is good to
keep the decision threshold high (high accuracy), if the premiums are being changed for policy
holders who are most likely predicted to terminate their policies. This ensures that premium
changes are made for only likely terminating customers.
Misclassification costs can be decided for generating a profit loss matrix. For example, if the
policy holder is classified as likely termination but he renews the policy, the misclassification
cost will be the discount offered to him as a bait to renew his policy. On the other hand if the
customer is not predicted as a termination and he actually terminates his policy,
13
14. misclassification cost will be loss of his premium for the next year. The optimal value of decision
threshold needs to be determined to minimize misclassification costs and maximize profits.
Pricing the policies is the tricky part. The pricing of policies occurs in four steps: prediction of
claim costs, identification of the right premium price to gain profitability, analysis of the
customer retention patterns considering the difference between old and new premiums, and
finally adjustment of these premiums to retain the customers and while still making profits.
These four steps are executed every time before marketing mails are going to be sent out to
the likely policy terminating customers. The new price of the policy could not suit the customer
and he may decide to terminate the policy. The data with new policy price together with the
difference of price is fed into the neural network model to predict the likely terminations with
new policy prices. The prices can then be adjusted to balance the goals of profitability and
customer retention. Optimal pricing is an iterative process with a goal of finding a balance.
Figure 5: Total of 29 inputs attributes [Smith]
14
15. Figure 6: Process flow diagram for customer retention classification [Smith]
Figure 7: Lift Chart showing percentage of policy holders classified for likely termination vs.
percentage of policy holders selected from the test dataset. It shows the performance comparison for
classification techniques such as regression, decision tree and neural networks [Smith]
15
16. 3.2. Auto Claim Fraud Detection using Bayesian Learning Neural Networks
Companies face a huge loss of money for fraudulent claims made by the insurers. Insurance
companies are looking for solutions for fraud claim prediction and diagnosis. These days they
are using tools that rely on neural networks and artificial intelligence to solve this problem.
Neural networks help in making general and scalable parameterized, non-linear mappings of
inputs and outputs. But there are also some problems with them, such as what weights to set
before training starts, how to avoid fitting the noise in training data which makes them difficult
to implement. The above issues are mostly solved by using the ad-hoc ways.
In this paper, [Viaene] have used Bayesian learning to deal with above issues while training the
neural networks. Bayesian learning learns the model in a step by step manner rather than ad-
hoc. [Viaene] explores predictive powers of Multi-Layer Perceptron (MLP) based neural
network classifiers trained with the help of [Mockay] evidence framework approach to Bayesian
learning which is used to optimize an automatic relevance determination (ARD) objective
function. ARD objective function is useful in determining the relative importance of the inputs
to the model. ARD and evidence framework approach is describes in more details below.
They have used the MLP back propagation neural network as shown in Fig 8. The hidden nodes
of network have hyperbolic tangent transfer function and output layer has logistic sigmoid
activation function. In Fig 8, x represents the input vector, z represents the output of the hidden
units and y represents the final output. The continuous output y(x) of this MLP classifier can be
interpreted as posterior probability ( | , which means the probability of getting class t =
16
17. 1 as output, given the input vector x. The Bayesian posterior probability estimates produced by
MLP help classify the input vector to predefined classes by choosing a threshold in scoring
interval. While training the network, the weight vector w needs to be adjusted so that the
objective function which is sum of squared errors is minimized.
They measured the accuracy of prediction with used of two metrics known as percentage
correctly classified (PCC) and area under the receiver operating characteristic curve (AUROC).
Figure 8: Example of three layers Neural Network [Viaene]
While optimizing the neural classifier for best generalizations, it should be avoided from
learning the noise in the training data, also known as over fitting. To avoid over fitting, usually
validation dataset is used. A better approach is to add the regularization or penalty term to the
objective function. The unit based regularization term is also known as ARD. The final objective
function now becomes ∑ .
17
18. They discussed about how critical is input selection to the overall classification process. The
(regularization parameter) in ARD objective function is helpful in suppressing the weights
exiting from inputs. Larger the more irrelevant is the input and vice versa. Regularization
parameter allows MLP-ARD to include large number of potentially relevant input variables, thus
eliminating the efforts needed to delete some irrelevant input variables. This also means
adjusting the degree of importance of the input variables in the classification process; this is
known as soft input selection.
Bayesian learning is used to make the probabilistic models for the dataset. These models are
then used for prediction. Bayesian models are described in terms of posterior probability
density over the weight space. Then prediction is made by integrating over the posterior
probability. The evidence framework approach to Bayesian learning for MLP classifiers they
discussed requires local Gaussian approximation to the posterior probability density. They
introduced the concept of input relevance or ARD on the evidence framework with the help of
the Gaussian assumptions. The main objective of doing all this is to get the appropriate values
for the weight vector w and the regularization parameter .
They used Personal Injury Protection (PIP) automobile insurance claim fraud detection dataset
for their evaluation. PIP claims dataset consists of 1399 closed automobile insurance claims
from accidents that occurred in Massachusetts, USA in 1993. This data has been investigated
for fraud suspicion by the domain experts. The dataset included 25 binary fraud indicators (red
flags); refer Fig 9 and 12 non indicator inputs (non-flags) that are valuable in assessing the
fraudulent claim by investigators. In this dataset, ACC is accident, CLT is claimant, INJ is injury,
18
19. and INS is insured driver .etc. The input selection is done after having discussions with domain
experts.
These closed claims are reviewed by claim manager for suspected fraud on the basis of these
indicators or inputs. Each claim is categorized on a 10 point scale for suspected fraud. Claims
are also reviewed on the basis of verbal assessment by the claim manager. Claim can be
suspected for fraud if suspicion score > = 4 and further investigation is done in this case,
otherwise no investigation is done.
Figure 9: PIP binary fraud indicators with values (0=No, 1=yes) [Viaene]
In empirical evaluation they are doing input selection using MLP-ARD on the PIP insurance
claims data. The input importance ranking they got from MLP-ARD is then compared with input
19
20. importance rankings from logistic regression and decision tree learning. They used logistic
regression approach to classification as a reference for comparison. They took the relative
importance of inputs based upon the regression coefficient as a reference. They used decision
trees approach for classification as a second reference. Implementation-wise they used the m-
estimation smoothed and curtailed C4.5 variant, which is a better version of C4.5 algorithm.
The input importance in decision tree is decided by its role in splitting the tree so that
maximum entropy difference can be achieved. The relative performance of the decision tree
implementation in predicting the input was not quite good compared to logistic regression and
MLP-ARD.
10 fold cross validation can also be performed for input evaluation for the above three
approaches of classification. This leads to ensemble based input assessment, which means
input assessment is aggregated and then averaged for 10 models of the cross validation. Fig 10
shows the input rankings derived from the three methods. Rank 1 input is the most important.
The number in brackets is the input importance relative to the Rank 1. From the rankings it is
observed that six of the MLP-ARD top ten inputs are same as logistic top ten and seven are
same from the C4.5 top ten. Form the input rankings it is observed that MLP-ARD and logistic
are giving comparable input rankings. All the three classifiers can be used at same time to give
an ensemble classifier.
20
21. Figure 10: Input Importance Ranking [Viaene]
4. Discussion
I found some reasoning for the results missing from [Lu] paper on rule extraction. They did not
explain many results in their analysis. For e.g., they did not explain, why for function 4 the
accuracy with neural network is less than with C4.5. They did not explain why number of
conditions for function 5 is less per neural network rule than per C4.5 rule, while for all other
functions this is just the opposite case. This paper was written in the year 1996 and considering
that time the research in this area was at nascent stage. They also did not explain how exactly
they arrived at the pruned network (refer Fig 2) with only four inputs. Their advocacy for the
21
22. use of neural networks in classification is justified for some scenarios of data classification
where training time is not the constraint.
While searching for the papers of the use of the neural networks in insurance industry, I found
that not much research done is out there in public. Surely there must be some credible work
done on the uses of neural networks in the insurance companies, but due the competition it is
not disclosed.
[Smith] have used SAS Enterprise Miner software for doing their analysis. While processing the
data they used the feature variable selection node of the SAS tool, but they did not explain
anything about how this functionality will work without the tool. In their results they gave
classification accuracies results for 0.1 and 0.5 decision thresholds. The way they presented the
numbers for these results for actually renewed, actually terminated, classified as renewed and
classified as terminated policies is not quite clear to me. Their evaluation is not quite strong as
they only present the lift chart for their comparisons with other classification approaches.
The paper by [Viaene] does not go with the title of the paper “Auto claim fraud detection using
Bayesian learning neural networks”. The researchers talk more about developing MLP-ARD
approach and incorporating it in the evidence framework method, than to talk about their use
in detecting claim frauds. The focus is more on theoretical side with lots of equations. The
background information for the various methods used in the paper is very less making the
paper difficult to understand. A lot of assumptions and approximations have been used to while
making their method work for soft input selection.
22
23. 5. Conclusions and Future Work
Using [Lu] method of extracting rules high quality rules can be obtained from the datasets.
Their works acts a bridging approach on using neural networks for classification purposes in
data mining. Time required for extracting rules is still large when compared to decision tree
approach. As a direction to future work they suggested the use of incremental training and rule
extraction from the database. Another way of reducing the training time and increasing the
accuracy is by reducing the input units of the network.
[Smith] tried to find a ways of doing optimal pricing of policies while retaining growth and
profitability. Their case study used the neural networks to learn and predict customer retention
patterns. They discussed some issues like the identification of misclassification cost to customer
retention analysis. Second issue is the implementation and incorporation of their method in the
insurance industry at a larger scale and in real time. They would like to work on these issues in
collaboration with the industry.
[Viaene] made a step in the direction of understanding the underling semantics of the neural
networks output prediction. This understanding is important for the use of neural networks in
everyday decision making tasks for prediction claim frauds. The impact of the input selection on
the claim fraud detection process was their main concern. They demonstrated the soft input
selection capabilities of their proposed MLP-ARD method on the real life insurance dataset.
I think neural networks due to their complex model making capabilities can be used more
effectively in insurance and other industries and there is still scope of lot of work.
23
24. References
1. [Lu] Hongjun Lu, Rudy Setiono and, Huan Liu, Effective Data Mining Using Neural
Networks, Vol 8, IEEE Transactions on Knowledge and Data Engineering,1996, pp. 957-
961
2. [MacKay] MacKay, D. J. C., The evidence framework applied to classification networks.
Neural Computation, 1992, 4(5), 720-736
3. [Scuse] David Scuse, Chapter 1 Intro, Class slides, University of Manitoba
4. [Setiono 1995] R. Setiono. A neural network construction algorithm which maximizes the
likelihood function, Connection Science, Vol. 7, No. 2, 1995, pages 147-166.
5. [Setiono] R. Setiono. A penalty-function approach for pruning feed forward neural
networks, Neural Computation, Vol. 9, No. 1, January 1997, pages 185-204.
6. [Smith] K.A. Smith, R.J. Willis and, M. Brooks, An Analysis of Customer Retention and
Insurance Claim Patterns Using Data Mining: A Case Study, The Journal of the
Operational Research Society, Vol. 51, May 2000, pp. 532-541
7. [Viaene] S. Viaene, G. Dedene and, R.A. Derrig, Auto claim fraud detection using
Bayesian learning neural networks, Journal of Expert Systems with Applications, Vol. 29,
pages 653 - 666, 2005
8. [Wiki] Data Mining, http://en.wikipedia.org/wiki/Data_mining
24