House Price Estimation as a Function Fitting Problem with using ANN Approach
Insurance Optimization
1. Project Report
Albert Chu
Introduction:
Insurance company need a way to assess people to decide insurance plans and prices and my
internship with Northwestern Mutual has to do with finding clients and assessing which
insurance portfolio is most suitable for them My methods of analysis is using Newton’s Method
to create a minimizing risk and maximizing Expected Returns. Given that the insurance industry
is hundreds of billions of dollars, tiny inaccuracies are magnified.
Method:
How can we make portfolio maximizing more accurate? Data sets are given that have numerous
factors that influence the eligibility of life insurance. All these variables are given dummy
variables: [0=no] and [1=yes] but each are weighted differently with a function. Using more than
just a Yes-No Scale or a 1-10 scale we can influence a more accurate result from the given data.
As said previously, our function with the predictor variables will have a different weight on the
result given to us by R and by eliminating insignificant variables in our data model set. With
that, we create a second degree polynomial with all the interaction terms. The first thing we
needed to do was to minimize the variance (v=xtVx) and maximize expected return (r = atx). We
will be using Newton’s method to find the tiny inaccuracies from the rounding that the function
gives people.
The function given to us from R: σ2(x)=σ2x2+σrx+V+error can now be implemented by our
dataset in the attached files. The overall variance and expected values are calculated then placed
2. in the function. Each person will have a different and unique way of showing if they will be
eligible for certain plans. Then to find the optimal point we will use method Newton’s Method
given our liability function and interpret the accuracy of the data with real life models to test if it
is viable. Thus, Newton’s method to execute more iterations to eliminate numerical rounding
errors as well as to more accurately represent the output by this pseudo code in the main set of
code:
newtons_method(test_preds, test['Response'].values,N,.01) starts with i = 1;
while( i <= N ):
optvalue = test['Response'].values-test['Response'].values/test['Response'].values;
then if ( abs(optvalue -test['Response'].values) < TOL ):
print('Took '+ str(i) + ' iterations');
return
i = i+1;
test['Response'].values=optvalue ; and print(optvalue );
Results:
The results would allow insurance company to invest less money on risks, thus allowing lower
rates to attract more clients, outputs a number that decides client’s risk assessment and
profitability after a certain amount of iterations, giving a very accurate number. The original train
score is what they originally received from the test given by the insurance company. Since we
are currently testing with the tolerance of only .01, Newton’s method only runs a few iterations
on most results. You can now independently find a person's train score with a lot more accuracy
and see what kind of package they are eligible for. For example clients 1,2, and 3 respectively:
Eliminate missing values
Train score is: 6.5
Optimization terminated successfully.
Current function value: 6.48286
Iterations: 4
Train score is: 7.3
Optimization terminated successfully.
Current function value: 7.29378
3. Iterations: 5
Train score is: 8.0
Optimization terminated successfully.
Current function value: 8.00905
Iterations: 8
The output for client 1 is given Absolute Error: .01714 Relative Error: .0026438
with the error is less than .01 as the tolerance set. this amount is significant enough such that of
millions of people in the US with insurance, that pay thousands of dollars a year, add up to huge
losses in money for insurance companies. To make this even better, I would need more time and
implement a lot more code to make this viable for the current economy.
4. Code:
import pandas as pd
import numpy as np
import xgboost as xgb
from scipy.optimize import fmin_powell
from ml_metrics import quadratic_weighted_kappa
def eval_wrapper(yhat, y):
y = np.array(y)
y = y.astype(int)
yhat = np.array(yhat)
yhat = np.clip(np.round(yhat), np.min(y), np.max(y)).astype(int)
return quadratic_weighted_kappa(yhat, y)
def get_params():
params = {}
params["objective"] = "reg:linear"
params["eta"] = 0.05
params["min_child_weight"] = 240
params["subsample"] = 0.9
params["colsample_bytree"] = 0.67
params["silent"] = 1
params["max_depth"] = 6
plst = list(params.items())
return plst
def apply_offset(data, bin_offset, sv, scorer=eval_wrapper):
# data has the format of pred=0, offset_pred=1, labels=2 in the first
dim
data[1, data[0].astype(int)==sv] = data[0, data[0].astype(int)==sv] +
bin_offset
score = scorer(data[1], data[2])
return score
# global variables
columns_to_drop = ['Id', 'Response',
'Medical_History_10','Medical_History_24']
xgb_num_rounds = 700
num_classes = 8
eta_list = [0.05] * 200
eta_list = eta_list + [0.02] * 500
print("Load the data using pandas")
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")
# combine train and test
all_data = train.append(test)
all_data['BMI_Age'] = all_data['BMI'] * all_data['Ins_Age']
5. med_keyword_columns =
all_data.columns[all_data.columns.str.startswith('Medical_Keyword_')]
all_data['Med_Keywords_Count'] = all_data[med_keyword_columns].sum(axis=1)
print('Eliminate missing values')
# Use -1 for any others
all_data.fillna(-1, inplace=True)
# fix the dtype on the label column
all_data['Response'] = all_data['Response'].astype(int)
# split train and test
train = all_data[all_data['Response']>0].copy()
test = all_data[all_data['Response']<1].copy()
# convert data to xgb data structure
xgtrain = xgb.DMatrix(train.drop(columns_to_drop, axis=1),
train['Response'].values)
xgtest = xgb.DMatrix(test.drop(columns_to_drop, axis=1),
label=test['Response'].values)
# get the parameters for xgboost
plst = get_params()
print(plst)
# train model
model = xgb.train(plst, xgtrain, xgb_num_rounds, learning_rates=eta_list)
# get preds
train_preds = model.predict(xgtrain, ntree_limit=model.best_iteration)
print('Train score is:', eval_wrapper(train_preds, train['Response']))
test_preds = model.predict(xgtest, ntree_limit=model.best_iteration)
train_preds = np.clip(train_preds, -0.99, 8.99)
test_preds = np.clip(test_preds, -0.99, 8.99)
# train offsets
# determine iterations for more accurate read
offsets = np.array([0.1, -1, -2, -1, -0.8, 0.02, 0.8, 1])
data = np.vstack((train_preds, train_preds, train['Response'].values))
for j in range(num_classes):
data[1, data[0].astype(int)==j] = data[0, data[0].astype(int)==j] +
offsets[j]
for j in range(num_classes):
train_offset = lambda x: -apply_offset(data, x, j)
offsets[j] = fmin_powell(train_offset, offsets[j])
newtons_method(test_preds, test['Response'].values,1000,.01);
# apply offsets to test
data = np.vstack((test_preds, test_preds, test['Response'].values))
for j in range(num_classes):
data[1, data[0].astype(int)==j] = data[0, data[0].astype(int)==j] +
offsets[j]
6. final_test_preds = np.round(np.clip(data[1], 1, 8)).astype(int)
preds_out = pd.DataFrame({"Id": test['Id'].values, "Response":
final_test_preds})
preds_out = preds_out.set_index('Id')
preds_out.to_csv('xgb_offset_submission.csv')
def newtons_method(ftrain,function,N,.01):
i = 1;
while( i <= N ):
p = p0-fstring(p0)/fpstring(p0);
print(p);
if( abs(p-p0) < TOL ):
print('Took '+ str(i) + ' iterations');
return
i = i+1;
p0=p;