Mais conteúdo relacionado

Similar a Insurance Fraud Claims Detection (20)


Insurance Fraud Claims Detection

  1. Insurance Fraud Claims Detection Arul Kumar ARK 225229103 I MSc Data Science Bishop Heber College (Autonomous), Trichy
  2. INTRODUCTION Insurance fraud claims refer to the illegal act of filing a false insurance claim or exaggerating a legitimate claim for financial gain. Fraudulent insurance claims not only result in financial losses for the insurance companies but also drive up the premiums for honest policyholders. Therefore, insurance companies invest significant resources in detecting and preventing insurance fraud claims.
  3. There are various techniques that insurance companies can use to detect fraud. Some of the commonly used methods include: ● Data analytics ● Machine learning ● Social media monitoring ● Investigative techniques ● Fraud detection software
  4. Machine learning is increasingly being used for insurance fraud claims detection. Machine learning algorithms can analyze large amounts of data to detect patterns that indicate fraud. There are several techniques that can be used in machine learning for insurance fraud claims detection, including: ● Supervised learning ● Unsupervised learning ● Deep learning ● Ensemble learning
  5. MOTIVATION: The motivation behind fraud claims detection is to protect insurance companies from financial losses that can result from fraudulent activities. By make use of some Machine Learning Algorithms to Detecting fraudulent claims 20XX 20XX 20XX 20XX
  6. Dataset description The Insurance Fraud Claims Detection dataset is a collection of insurance claims made by policyholders. The dataset is designed to help insurance companies detect fraudulent claims and improve their claims processing accuracy. The dataset contains a total of 1000 instances and 40 features, including both numerical and categorical variables. Each instance in the dataset represents a single insurance claim, and the features describe various aspects of the claim, such as the policyholder's age, gender, location, type of insurance, claim amount, and other related information. The target variable in the dataset is a binary label indicating whether the claim is fraudulent or not. About 14.4% of the claims in the dataset are labeled as fraudulent.
  7. Columns ‘months_as_customer’ , 'age', 'policy_number', 'policy_bind_date', 'policy_state', 'policy_csl', 'policy_deductable','policy_annual_premium', 'umbrella_limit', 'insured_zip', 'insured_sex','insured_education_level', 'insured_occupation', 'insured_hobbies', 'insured_relationship', 'capital-gains', 'capital-loss', 'incident_date', 'incident_type', 'collision_type', 'incident_severity', 'authorities_contacted', 'incident_state', 'incident_city', 'incident_location', 'incident_hour_of_the_day', 'number_of_vehicles_involved', 'property_damage', 'bodily_injuries', 'witnesses', 'police_report_available', 'total_claim_amount', 'injury_claim', 'property_claim', 'vehicle_claim', 'auto_make', 'auto_model', 'auto_year', 'fraud_reported', '_c39'
  8. Numerical Columns respective with Fraud report
  9. Categorical Columns respective with Fraud report
  10. Plot Heatmap : Headmap to check Correlation ( Correlation explains how one or more variables are related to each other )
  11. Check Outlier : *Outlier decreases the value of a correlation coefficient and weakens the regression relationship*
  12. StandardScaler for standardize the features of a dataset LabelEncoder used for encoding categorical variables as numerical variables. It converts each unique categorical value into a numerical Split ● X: the array of feature values ● y: the array of target values ● test_size: the proportion of the data to be used for testing (usually between 0.2 and 0.3) ● random_state: a random seed for reproducibility ● X_train: the array of feature values for the training set ● X_test: the array of feature values for the testing set ● y_train: the array of target values for the training set ● y_test: the array of target values for the testing set Fit And Transform
  13. Algorithms LogisticRegression KNeighborsClassifier DecisionTreeClassifier
  14. LogisticRegression
  15. KNeighborsClassifier
  16. DecisionTreeClassifier
  17. Tree
  18. Comparison LogisticRegression Accuracy Score : 0.72 Mean Squared Error : 0.28 KNeighborsClassifier Accuracy Score : 0.685 Mean Squared Error : 0.315 DecisionTreeClassifier Accuracy Score : 0.805 Mean Squared Error : 0.19
  19. Comparison : Visualization
  20. Confusion Matrix Comparison Logistic Regression K-Nearest Neighbors Decision Tree
  21. The best model with the lowest MSE to be selected is ['DecisionTreeClassifier'] Lowest MSE
  22. DecisionTreeClassifier : Best estimator *GridSearchCV* Best Parameters : {'criterion': 'entropy', 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 3}
  23. DecisionTreeClassifier : Best estimator *GridSearchCV*
  24. Important features
  25. DecisionTreeClassifier : Important features
  26. Classification Report DTC vs DTC :Important features vs DTC : Best estimator DTC DTC :Important features DTC : Best estimator
  27. Confusion Matrix Comparison DTC vs DTC :Important features vs DTC : Best estimator DTC DTC :Important features DTC : Best estimator
  28. Function : plot_confusion_matrix The confusion matrix is a table that is used to evaluate the performance of a classification model by comparing the predicted labels of the model with the true labels. The confusion matrix shows the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) that the model has produced. The plot_confusion_matrix function takes a trained classifier and a set of test data as inputs and plots a colored matrix that represents the values in the confusion matrix. The rows of the matrix represent the true labels, while the columns represent the predicted labels. The diagonal of the matrix represents the correct predictions, while the off-diagonal elements represent the incorrect predictions. The color of each cell represents the number of instances that have been classified in that category. The plot_confusion_matrix function can help in understanding the performance of a classifier by visualizing how well the model is predicting each class. It can also be used to compare the performance of different classifiers or different hyperparameters of the same classifier. Overall, plot_confusion_matrix is a useful tool in the evaluation and comparison of classification models, as it provides an intuitive way to visualize and understand the performance of the models.
  29. ROC DTC vs DTC :Important features vs DTC : Best estimator
  30. Receiver Operating Characteristic (ROC) When comparing ROC curves, we are typically interested in determining which model performs better at distinguishing between the positive and negative cases. The ROC curve can help us to visualize this comparison by showing the trade-off between true positive rate (TPR) and false positive rate (FPR) for each model. In general, a better model will have an ROC curve that is closer to the top-left corner of the plot, which corresponds to higher TPR and lower FPR. Conversely, a worse model will have an ROC curve that is closer to the diagonal line, which corresponds to random guessing. Another way to compare ROC curves is to calculate the area under the curve (AUC) for each model. The AUC is a metric that summarizes the overall performance of the model, with a perfect classifier having an AUC of 1 and a random classifier having an AUC of 0.5. If the AUC values of two models are compared, the model with the higher AUC is considered to be a better model. This is because the AUC provides a single value that summarizes the overall performance of the model across all possible classification thresholds. In summary, when comparing ROC curves, we can visually compare the trade-off between TPR and FPR for each model, and we can also compare the AUC values to determine which model has better overall performance.
  31. CONCLUSION Insurance Fraud Claims Detection in Machine Learning is a crucial application of supervised learning algorithms in the insurance industry. It helps insurers to identify and prevent fraudulent activities by predicting whether a given insurance claim is fraudulent or not. By reducing their financial losses, insurers can offer competitive premiums to their customers and improve customer satisfaction. Moreover, detecting fraudulent activities can also help insurers to maintain their reputation in the market by preventing negative publicity due to fraudulent claims. Therefore, the use of Machine Learning in Insurance Fraud Claims Detection is beneficial for both insurers and policyholders alike.