Embark on a captivating journey into the realm of customer churn prediction with this insightful data analysis project presented by Boston Institute of Analytics. Our talented students delve into the intricacies of customer behavior, leveraging advanced data analysis techniques to forecast and mitigate churn risks. From examining historical customer data and purchase patterns to identifying predictive indicators and developing robust churn prediction models, this project offers a comprehensive exploration of the factors influencing customer retention. Gain invaluable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating world of customer churn prediction and unlock new perspectives on customer relationship management. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
The document discusses building a machine learning model to predict customer churn for a telecommunications company using a dataset containing customer characteristics. It describes preprocessing the data, exploring the features, training various classification models including logistic regression, support vector machines, random forests and decision trees, and evaluating model performance. Logistic regression achieved the best results with 79% accuracy at predicting whether customers will churn. Future work could include reducing more features and testing additional models to improve accuracy for predicting telecom customer churn.
This document discusses predicting customer churn for a telecommunications company. It begins with an introduction to the problem and dataset, which contains information on 7,043 customers. It then preprocesses the data, which has 19 variables on demographic, account, and service characteristics. Various machine learning algorithms are trained and evaluated on the data, with logistic regression achieving the best accuracy of 79%. The document concludes with opportunities for future improvement and acknowledgments.
Explore our students' cutting-edge project on predicting bank customer churn using advanced analytics techniques. This project employs machine learning algorithms to analyze customer data and forecast the likelihood of churn, offering valuable insights for financial institutions. Gain insights into customer retention strategies, predictive modeling, and the potential impact on banking operations. To learn more, do check out https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
- This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.
- This will ensure that the consumers capable of repaying the loan are not rejected.
- Identification of such applicant's using EDA is the aim of this case study.
The document discusses data preprocessing techniques. It covers why preprocessing is important by addressing issues like incomplete, inaccurate, or inconsistent data. It then describes major tasks in preprocessing like data cleaning, integration, reduction, transformation. Data cleaning techniques discussed include handling missing values, removing noise, and resolving inconsistencies. The goal of preprocessing is to improve data quality and prepare it for data mining.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
The document discusses various data mining tasks relevant to customer relationship management (CRM). It describes classification, regression, link analysis, and deviation detection. Classification involves mapping data into predefined classes and is used for credit approvals, fraud detection, and targeting offers. Regression establishes relationships between variables to predict outcomes like sales or churn. Link analysis identifies connections between data items to reveal patterns in areas like referrals, purchases, and websites. Deviation detection finds significant changes from normal values to identify anomalies.
Predicting Bank Customer Churn Using ClassificationVishva Abeyrathne
This document describes a study that used classification models to predict customer churn for a bank. The authors collected a dataset of 10,000 bank customers from Kaggle and preprocessed the data. They then explored relationships between features and the target variable of whether a customer churned. Two classification models were tested - KNN and Decision Tree. After hyperparameter tuning, Decision Tree achieved the best accuracy of 84.25%, outperforming KNN. However, both models struggled to accurately predict customers who would churn. The authors concluded Decision Tree was the best model but recommend collecting more data on churning customers.
The document discusses building a machine learning model to predict customer churn for a telecommunications company using a dataset containing customer characteristics. It describes preprocessing the data, exploring the features, training various classification models including logistic regression, support vector machines, random forests and decision trees, and evaluating model performance. Logistic regression achieved the best results with 79% accuracy at predicting whether customers will churn. Future work could include reducing more features and testing additional models to improve accuracy for predicting telecom customer churn.
This document discusses predicting customer churn for a telecommunications company. It begins with an introduction to the problem and dataset, which contains information on 7,043 customers. It then preprocesses the data, which has 19 variables on demographic, account, and service characteristics. Various machine learning algorithms are trained and evaluated on the data, with logistic regression achieving the best accuracy of 79%. The document concludes with opportunities for future improvement and acknowledgments.
Explore our students' cutting-edge project on predicting bank customer churn using advanced analytics techniques. This project employs machine learning algorithms to analyze customer data and forecast the likelihood of churn, offering valuable insights for financial institutions. Gain insights into customer retention strategies, predictive modeling, and the potential impact on banking operations. To learn more, do check out https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
- This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc.
- This will ensure that the consumers capable of repaying the loan are not rejected.
- Identification of such applicant's using EDA is the aim of this case study.
The document discusses data preprocessing techniques. It covers why preprocessing is important by addressing issues like incomplete, inaccurate, or inconsistent data. It then describes major tasks in preprocessing like data cleaning, integration, reduction, transformation. Data cleaning techniques discussed include handling missing values, removing noise, and resolving inconsistencies. The goal of preprocessing is to improve data quality and prepare it for data mining.
Reduction in customer complaints - Mortgage IndustryPranov Mishra
The project aims at analysis of Customer Complaints/Inquiries received by a US based mortgage (loan) servicing company..
The goal of the project is building a predictive model using the identified significant
contributors and coming up with recommendations for changes which will lead to
1. Reducing Re-work
2. Reducing Operational Cost
3. Improve Customer Satisfaction
4. Improve company preparedness to respond to customer.
Three models were built - Logistic Regression, Random Forest and Gradient Boosting. It was seen that the accuracy, auc (Area under the curve), sensitivity and specificity improved drastically as the model complexity increased from simple to complex.
Logistic regression was not generalizing well to a non-linear data. So the model was suffering from both bias and variance. Random Forest is an ensemble technique in itself and helps with reducing variance to a great extent. Gradient Boosting, with its sequential learning ability, helps reduce the bias. The results from both random forest and gradient boosting did not differ by much. This is confirming the bias-variance trade-off concept which states that complex models will do well on non-linear data as the inflexible simple models will have high bias and can have high variance.
Additionally, a lift chart was built which gives a Cumulative lift of 133% in the first four deciles
The document discusses various data mining tasks relevant to customer relationship management (CRM). It describes classification, regression, link analysis, and deviation detection. Classification involves mapping data into predefined classes and is used for credit approvals, fraud detection, and targeting offers. Regression establishes relationships between variables to predict outcomes like sales or churn. Link analysis identifies connections between data items to reveal patterns in areas like referrals, purchases, and websites. Deviation detection finds significant changes from normal values to identify anomalies.
Predicting Bank Customer Churn Using ClassificationVishva Abeyrathne
This document describes a study that used classification models to predict customer churn for a bank. The authors collected a dataset of 10,000 bank customers from Kaggle and preprocessed the data. They then explored relationships between features and the target variable of whether a customer churned. Two classification models were tested - KNN and Decision Tree. After hyperparameter tuning, Decision Tree achieved the best accuracy of 84.25%, outperforming KNN. However, both models struggled to accurately predict customers who would churn. The authors concluded Decision Tree was the best model but recommend collecting more data on churning customers.
This document describes a study that used classification models to predict customer churn for a bank. The authors collected a dataset of 10,000 bank customers with 14 features from Kaggle and preprocessed the data. They explored relationships between features and the target (churn) variable. Two classifiers were tested - KNN and decision tree. After hyperparameter tuning, the decision tree model achieved the best accuracy of 84.25%, outperforming KNN. However, both models predicted churn (class 1) less accurately than non-churn (class 0). The decision tree was selected as the best overall model despite its weakness in predicting churn.
Computing Ratings and Rankings by Mining Feedback CommentsIRJET Journal
This document presents a framework for computing ratings and rankings of sellers on e-commerce platforms by mining feedback comments. It aims to address the issue of "all good reputation" where feedback is overwhelmingly positive. The proposed approach uses text mining techniques like opinion mining and sentiment analysis on feedback comments to extract aspect ratings for different dimensions of transactions. A calculation is proposed using dependency analysis and Latent Dirichlet Allocation to cluster aspect expressions into dimensions and compute dimension ratings and weights. Testing on eBay and Amazon data shows this approach can better distinguish sellers by reducing positive bias compared to existing reputation systems.
Dive deep into the world of insurance churn prediction with this captivating data analysis project presented by Boston Institute of Analytics. Our talented students embark on a journey to unravel the mysteries behind customer churn in the insurance industry, leveraging advanced data analysis techniques to forecast and anticipate customer behavior. From analyzing historical data and customer demographics to identifying predictive indicators and developing churn prediction models, this project offers a comprehensive exploration of the factors influencing insurance churn dynamics. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on insurance churn prediction. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Many customers often switch or unsubscribe (churn) from their telecom providers for a variety of reasons. These could range from unsatisfactory service, better pricing from competitors, customers moving to different cities etc. Therefore, telecom companies are interested in analyzing the patterns for customers who churn from their services and use the resultant analysis to determine in the future which customers are more likely to unsubscribe from their services. One such company is Telco Systems. Telco Systems is interested in identifying the precise patterns for their churning customers and have provided the customer data for this project.
Delve into the realm of predictive modeling for loan approval. Learn how data science is revolutionizing the lending industry, making the loan approval process faster, more accurate, and fairer. Discover the key factors that influence loan decisions and how predictive modelling is shaping the future of lending. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
This document is a report analyzing revenue decline at a Portuguese banking institution. It uses data on 41188 clients to predict subscription to term deposits through machine learning models. The methodology section describes data preparation including handling missing values and outliers. Various data exploration visualizations are presented. KNN and decision tree models are applied at different train-test splits, with decision tree achieving slightly better accuracy scores. The duration attribute is found to most influence subscription. The report concludes decision trees perform better than KNN for this prediction problem.
Case Study: It’s All About Data – And the CustomerJill Kirkpatrick
Utilities are unlocking the power of data by coordinating forms of information across organizational departments, applications and databases to personalize their services and put customer at the center of their businesses
Explore in-depth insights into the intricate world of bank loan approval with this compelling data analysis project presented by Boston Institute of Analytics. Our talented students delve into the complexities of loan approval processes, leveraging advanced data analysis techniques to uncover patterns, trends, and factors influencing loan decisions. From evaluating credit scores and income levels to analyzing loan terms and default rates, this project offers a comprehensive examination of the key metrics and variables impacting bank loan approval. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on bank loan approval dynamics. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Data mining and analysis of customer churn datasetRohan Choksi
The document discusses a study conducted by a mobile phone company to analyze factors related to customer churn. The company provided a dataset of 3,332 customer records to build a neural network model that can predict which customers are likely to switch providers. Examining the data showed that increased usage of night, evening, and day minutes, as well as more customer service calls, correlated with higher churn. International calling plans also had a major impact on churn rates. The model achieved a misclassification rate of 7.11% and identified key variables for the company to address to reduce churn, such as international call pricing and infrastructure issues.
This document discusses using data mining techniques like clustering and classification to segment customers for shipping enterprises. It proposes using k-means clustering on historical freight data to divide customers into segments. A Bayesian network classifier is then used to classify new customers based on the clustering results. The goal is to support marketing decisions by identifying the most valuable customers to focus on through scientific customer segmentation.
This document discusses predicting loan defaults through machine learning models. It begins by introducing the business problem of banks suffering losses from customer loan defaults. It then describes preprocessing the loan dataset, which includes handling missing data, label encoding categorical variables, and balancing the dataset using SMOTE and SMOTEENN techniques. Logistic regression, decision trees, AdaBoost and random forest algorithms are applied to both the original and balanced datasets. The random forest model on the balanced data using SMOTEENN achieved the best accuracy of 92%. The model is then pickled and integrated into a web application using Flask for users to predict loan defaults.
Exploratory Data Analysis (EDA) is used to analyze datasets and summarize their main characteristics visually. EDA involves data sourcing, cleaning, univariate analysis with visualization to understand single variables, bivariate analysis with visualization to understand relationships between two variables, and deriving new metrics from existing data. EDA is an important first step for understanding data and gaining confidence before building machine learning models. It helps detect errors, anomalies, and map data structures to inform question asking and data manipulation for answering questions.
Currently power theft is a common problem face by all electricity companies. Since power theft directly affect the profit made by electricity companies, theft detection and prevention of electricity is mandatory. In this paper we proposed a hybrid approach to detect the electricity theft i.e. to detect suspected consumers who is doing theft. We use SVM and ELM for our approach. We also compare our approach with KNN.
Cross selling credit card to existing debit card customersSaurabh Singh
The document describes a process for identifying existing debit card customers who may be good candidates for credit cards using cluster analysis. Transaction and customer data will be analyzed to group customers into clusters. Debit card customers in clusters that also include credit card holders will be identified as potential new credit card customers. Two campaign programs are proposed: offering credit cards when a debit customer makes an unusually large transaction, and incentivizing the remaining identified potential customers.
Customer churn has been evolving as one of the major problems for financial organizations. The incessant competitions in the market and high cost of acquiring new customers have made organizations to drive their focus towards more effective customer retention strategies.
Online Service Rating Prediction by Removing Paid Users and Jaccard CoefficientIRJET Journal
This document summarizes a research paper that proposes a new method for online service rating prediction. The method first filters out paid users from rating datasets using visibility and interest metrics. It then learns the latent feature values of users and items based on interpersonal interest similarity, personal interest, rating similarity between friends, and Jaccard coefficient of common friends. The method is evaluated on precision, recall, detection rate and false alarm rate and shown to outperform an existing method called EURB on different sized datasets.
This document provides guidance for completing a detailed questionnaire for a customer service benchmark study. It outlines the scope and functional areas covered in the study, including contact center, billing, payment processing, field service, and more. It provides definitions for important terms and explains what costs should be included and excluded. The guidelines help ensure responses are accurate, comparable, and understood in the context of the underlying process models.
The document discusses dimensional modeling, which is a technique for structuring data to make it intuitive for business users and enable fast query performance. Dimensional modeling divides data into facts and dimensions. Facts are numeric measures, while dimensions provide context about the who, what, when, where, and how of the facts. The document describes different types of facts (additive, semi-additive, non-additive) and provides examples to illustrate dimensional modeling concepts like fact tables, dimension tables, and star and snowflake schemas.
The document provides guidance for completing a customer service benchmarking questionnaire. It defines key terms, outlines the scope and organization of the questionnaire, and provides instructions on reporting costs, staffing, and other metrics. Responses should be brief, 2-3 sentences, and focus on established practices rather than new initiatives. Certain costs like facilities and CIS systems are excluded from reporting.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This presentation explores how K-means clustering can be used to analyze solar production data and identify patterns that can help optimize energy generation. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more
Mais conteúdo relacionado
Semelhante a Decoding Patterns: Customer Churn Prediction Data Analysis Project
This document describes a study that used classification models to predict customer churn for a bank. The authors collected a dataset of 10,000 bank customers with 14 features from Kaggle and preprocessed the data. They explored relationships between features and the target (churn) variable. Two classifiers were tested - KNN and decision tree. After hyperparameter tuning, the decision tree model achieved the best accuracy of 84.25%, outperforming KNN. However, both models predicted churn (class 1) less accurately than non-churn (class 0). The decision tree was selected as the best overall model despite its weakness in predicting churn.
Computing Ratings and Rankings by Mining Feedback CommentsIRJET Journal
This document presents a framework for computing ratings and rankings of sellers on e-commerce platforms by mining feedback comments. It aims to address the issue of "all good reputation" where feedback is overwhelmingly positive. The proposed approach uses text mining techniques like opinion mining and sentiment analysis on feedback comments to extract aspect ratings for different dimensions of transactions. A calculation is proposed using dependency analysis and Latent Dirichlet Allocation to cluster aspect expressions into dimensions and compute dimension ratings and weights. Testing on eBay and Amazon data shows this approach can better distinguish sellers by reducing positive bias compared to existing reputation systems.
Dive deep into the world of insurance churn prediction with this captivating data analysis project presented by Boston Institute of Analytics. Our talented students embark on a journey to unravel the mysteries behind customer churn in the insurance industry, leveraging advanced data analysis techniques to forecast and anticipate customer behavior. From analyzing historical data and customer demographics to identifying predictive indicators and developing churn prediction models, this project offers a comprehensive exploration of the factors influencing insurance churn dynamics. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on insurance churn prediction. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Many customers often switch or unsubscribe (churn) from their telecom providers for a variety of reasons. These could range from unsatisfactory service, better pricing from competitors, customers moving to different cities etc. Therefore, telecom companies are interested in analyzing the patterns for customers who churn from their services and use the resultant analysis to determine in the future which customers are more likely to unsubscribe from their services. One such company is Telco Systems. Telco Systems is interested in identifying the precise patterns for their churning customers and have provided the customer data for this project.
Delve into the realm of predictive modeling for loan approval. Learn how data science is revolutionizing the lending industry, making the loan approval process faster, more accurate, and fairer. Discover the key factors that influence loan decisions and how predictive modelling is shaping the future of lending. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
This document is a report analyzing revenue decline at a Portuguese banking institution. It uses data on 41188 clients to predict subscription to term deposits through machine learning models. The methodology section describes data preparation including handling missing values and outliers. Various data exploration visualizations are presented. KNN and decision tree models are applied at different train-test splits, with decision tree achieving slightly better accuracy scores. The duration attribute is found to most influence subscription. The report concludes decision trees perform better than KNN for this prediction problem.
Case Study: It’s All About Data – And the CustomerJill Kirkpatrick
Utilities are unlocking the power of data by coordinating forms of information across organizational departments, applications and databases to personalize their services and put customer at the center of their businesses
Explore in-depth insights into the intricate world of bank loan approval with this compelling data analysis project presented by Boston Institute of Analytics. Our talented students delve into the complexities of loan approval processes, leveraging advanced data analysis techniques to uncover patterns, trends, and factors influencing loan decisions. From evaluating credit scores and income levels to analyzing loan terms and default rates, this project offers a comprehensive examination of the key metrics and variables impacting bank loan approval. Gain valuable insights and actionable recommendations derived from rigorous data analysis, presented in an engaging and informative format. Don't miss this opportunity to delve into the fascinating realm of data analysis and unlock new perspectives on bank loan approval dynamics. Explore the project now and embark on a journey of discovery with Boston Institute of Analytics. To learn more about our data science and artificial intelligence programs, visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/.
Data mining and analysis of customer churn datasetRohan Choksi
The document discusses a study conducted by a mobile phone company to analyze factors related to customer churn. The company provided a dataset of 3,332 customer records to build a neural network model that can predict which customers are likely to switch providers. Examining the data showed that increased usage of night, evening, and day minutes, as well as more customer service calls, correlated with higher churn. International calling plans also had a major impact on churn rates. The model achieved a misclassification rate of 7.11% and identified key variables for the company to address to reduce churn, such as international call pricing and infrastructure issues.
This document discusses using data mining techniques like clustering and classification to segment customers for shipping enterprises. It proposes using k-means clustering on historical freight data to divide customers into segments. A Bayesian network classifier is then used to classify new customers based on the clustering results. The goal is to support marketing decisions by identifying the most valuable customers to focus on through scientific customer segmentation.
This document discusses predicting loan defaults through machine learning models. It begins by introducing the business problem of banks suffering losses from customer loan defaults. It then describes preprocessing the loan dataset, which includes handling missing data, label encoding categorical variables, and balancing the dataset using SMOTE and SMOTEENN techniques. Logistic regression, decision trees, AdaBoost and random forest algorithms are applied to both the original and balanced datasets. The random forest model on the balanced data using SMOTEENN achieved the best accuracy of 92%. The model is then pickled and integrated into a web application using Flask for users to predict loan defaults.
Exploratory Data Analysis (EDA) is used to analyze datasets and summarize their main characteristics visually. EDA involves data sourcing, cleaning, univariate analysis with visualization to understand single variables, bivariate analysis with visualization to understand relationships between two variables, and deriving new metrics from existing data. EDA is an important first step for understanding data and gaining confidence before building machine learning models. It helps detect errors, anomalies, and map data structures to inform question asking and data manipulation for answering questions.
Currently power theft is a common problem face by all electricity companies. Since power theft directly affect the profit made by electricity companies, theft detection and prevention of electricity is mandatory. In this paper we proposed a hybrid approach to detect the electricity theft i.e. to detect suspected consumers who is doing theft. We use SVM and ELM for our approach. We also compare our approach with KNN.
Cross selling credit card to existing debit card customersSaurabh Singh
The document describes a process for identifying existing debit card customers who may be good candidates for credit cards using cluster analysis. Transaction and customer data will be analyzed to group customers into clusters. Debit card customers in clusters that also include credit card holders will be identified as potential new credit card customers. Two campaign programs are proposed: offering credit cards when a debit customer makes an unusually large transaction, and incentivizing the remaining identified potential customers.
Customer churn has been evolving as one of the major problems for financial organizations. The incessant competitions in the market and high cost of acquiring new customers have made organizations to drive their focus towards more effective customer retention strategies.
Online Service Rating Prediction by Removing Paid Users and Jaccard CoefficientIRJET Journal
This document summarizes a research paper that proposes a new method for online service rating prediction. The method first filters out paid users from rating datasets using visibility and interest metrics. It then learns the latent feature values of users and items based on interpersonal interest similarity, personal interest, rating similarity between friends, and Jaccard coefficient of common friends. The method is evaluated on precision, recall, detection rate and false alarm rate and shown to outperform an existing method called EURB on different sized datasets.
This document provides guidance for completing a detailed questionnaire for a customer service benchmark study. It outlines the scope and functional areas covered in the study, including contact center, billing, payment processing, field service, and more. It provides definitions for important terms and explains what costs should be included and excluded. The guidelines help ensure responses are accurate, comparable, and understood in the context of the underlying process models.
The document discusses dimensional modeling, which is a technique for structuring data to make it intuitive for business users and enable fast query performance. Dimensional modeling divides data into facts and dimensions. Facts are numeric measures, while dimensions provide context about the who, what, when, where, and how of the facts. The document describes different types of facts (additive, semi-additive, non-additive) and provides examples to illustrate dimensional modeling concepts like fact tables, dimension tables, and star and snowflake schemas.
The document provides guidance for completing a customer service benchmarking questionnaire. It defines key terms, outlines the scope and organization of the questionnaire, and provides instructions on reporting costs, staffing, and other metrics. Responses should be brief, 2-3 sentences, and focus on established practices rather than new initiatives. Certain costs like facilities and CIS systems are excluded from reporting.
Semelhante a Decoding Patterns: Customer Churn Prediction Data Analysis Project (20)
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This presentation explores how K-means clustering can be used to analyze solar production data and identify patterns that can help optimize energy generation. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more
This presentation dives into the world of data science and explores its application in predicting salary ranges. We'll uncover the secrets hidden within data sets, unveil the power of machine learning algorithms, and shed light on factors that influence salaries in today's job market.
Visit for more https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This presentation explores the potential of machine learning in predicting the severity of road accidents. We will delve into the data analysis process, the chosen machine learning algorithms, and the evaluation of our model's performance. This project aims to contribute to improved emergency response times and accident prevention strategies. visit for more: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Explore how our student team leveraged data science to forecast power consumption, empowering smarter energy management and sustainability initiatives. visit for more: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
In today's digital world, credit card fraud is a growing concern. This project explores machine learning techniques for credit card fraud detection. We delve into building models that can identify suspicious transactions in real-time, protecting both consumers and financial institutions. for more detection and machine learning algorithm explore data science and analysis course: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Delve into the realm of sensor networks and uncover the sophisticated techniques employed for anomaly detection and event prediction. From statistical analysis to machine learning algorithms, explore how these technologies empower proactive decision-making in various domains, including industrial monitoring, environmental sensing, and healthcare systems. To learn more about detection and other techniques visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Explore the cutting-edge methods and technologies utilized in rain forecasting, from traditional meteorological models to machine learning algorithms. Discover how these predictive tools enable accurate anticipation of rainfall patterns, aiding in disaster preparedness, agriculture planning, and urban infrastructure management. To learn in detail about analysis and prediction visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Ever wondered what factors influence house prices? This project explores the world of house price prediction using data science techniques. We delve into analyzing real estate data to build models that can estimate the value of a home. This can be a valuable tool for both buyers and sellers navigating the housing market. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more details
This project explores sentiment analysis, a technique used to understand emotions expressed in text. We delve into the world of movie reviews, applying sentiment analysis techniques to uncover audience sentiment towards various films. This can provide valuable insights for filmmakers, studios, and moviegoers alike. For more analysis and artificial intelligence related content visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This slideshow dives into a data-driven analysis of NYC shootings. By employing cluster analysis, we uncover hidden patterns within these incidents, providing insights that can aid in crime prevention strategies. for more such analysis and management visit : https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Join us for a detailed examination of the cybersecurity posture of Travelblog.org, where we uncover potential vulnerabilities and suggest strategies for improvement. Learn how to protect websites from cyber threats and secure your digital presence by enrolling in our cybersecurity course at Boston Institute of Analytics. https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
Description: This presentation offers a deep dive into SQL Injection (SQLi) and Cross-Site Request Forgery (CSRF) vulnerabilities, demonstrating their impact through real-world examples. Join us to learn how to prevent and mitigate these threats, and take the first step towards a career in cybersecurity with our specialized courses at Boston Institute of Analytics. https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
This project demonstrates a machine learning approach to detecting credit card fraud using advanced algorithms and techniques. The project utilizes a dataset containing various features such as transaction amount, merchant location, time of transaction, and others to build a predictive model. The presentation covers data preprocessing steps, feature engineering techniques, and the selection of machine learning algorithms such as logistic regression or random forest. It also discusses model evaluation metrics and the importance of fraud detection in financial institutions for safeguarding against fraudulent activities. Visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This project showcases an AI-driven approach to detecting credit card fraud using machine learning algorithms. The project utilizes a dataset containing transactions with various features such as transaction amount, location, and time. The goal is to build a predictive model that can accurately identify fraudulent transactions and minimize financial losses for banks and customers. The presentation covers data preprocessing techniques, feature engineering, and the application of machine learning algorithms such as logistic regression or random forests. It also discusses model evaluation metrics and the importance of fraud detection in the banking industry. Visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This project presents a machine learning approach to predicting house prices using a dataset containing various features such as the size of the house, number of bedrooms, location, and others. The project aims to build a predictive model that can accurately estimate the selling price of a house based on its features. The presentation covers data preprocessing steps, feature selection techniques, and the application of machine learning algorithms such as linear regression or decision trees. It also discusses model evaluation metrics and the potential impact of the model on the real estate industry. Visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This project aims to predict whether a loan application will be approved or denied based on various factors such as applicant's income, credit score, loan amount, etc. Using a dataset containing historical loan application data, we employed machine learning algorithms to build a predictive model. The model was trained on features such as applicant's income, credit history, loan amount, loan term, and others. After training the model, we evaluated its performance using metrics like accuracy, precision, recall, and F1 score. The insights from this project can help financial institutions streamline their loan approval process and make informed decisions. Visit for more information: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
This presentation dives into the detailed analysis of vulnerabilities discovered in the web infrastructure of Aladel.net, highlighting potential security risks and offering insights into strengthening the website's defenses. Learn about the methods used to identify these vulnerabilities and the recommended strategies to mitigate them, ensuring a more secure online presence for Aladel.net for more information explore our ethical hacking course : https://bostoninstituteofanalytics.org/cyber-security-and-ethical-hacking/
This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/
Delve into the world of e-commerce order prediction and discover how data science is revolutionizing inventory management and customer satisfaction. Learn how predictive analytics can forecast future orders, optimize inventory levels, and enhance the overall shopping experience. Join us as we unravel the complexities of e-commerce forecasting. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
3. PROJECT CONTENT
I. Introduction and Problem Statement
II. Data Loading
III. Data Exploring
IV. Data Cleaning
IV.1. Binning
V. Data Visualization
V.1. Univariate Analysis
V.2. Bivariate Analysis
VI. Feature Engineering
VII. Data Preprocessing
VIII. Train – Test Split
IX. Feature Scaling
X. Smoteenn
XI. Model Building and Evaluation
XII. Model Comparison
CUSTOMER
CHURN
4. I. INTRODUCTION
Q. What is Customer Churn?
• Customer churn is defined as when customers or subscribers
discontinue doing business with a firm or service
• Each row represents a customer, each column contains
customer’s attributes described on the column Metadata.
The data set includes information about:
• Customers who left within the last month – the column is called
Churn .
• Services that each customer has signed up for – phone, multiple
lines, internet, online security, online backup, device protection,
tech support, and streaming TV and movies.
• Customer account information – how long they’ve been a
customer, contract, payment method, paperless billing, monthly
charges, and total charges.
• Demographic information about customers – Customer ID,
gender, and if they have partners and dependents.
THIS IS A CLASSIC TELECOM CHURN USECASE.
5. PROBLEM STATEMENT
The target variable Telco Churn dataset typically revolves
around predicting customer churn. It has only two possible
outcomes: churn or not churn (Binary Classification). "Churn" refers
to the scenario where customers who are likely to cancel their
contracts soon. In the telecom industry, customer churn can be a
significant issue, as it can lead to revenue loss. If the company can
predict that, it can handle users before churn.
6. APPROACH TO SOLVE PROBLEM
STATEMENT
1. Exploratory Data Analysis (EDA) to understand data patterns
and relationships.
2. Data preprocessing, including handling missing values,
encoding categorical variables, and feature scaling.
3. Splitting the dataset into training and testing sets.
4. Building and training machine learning models for churn
prediction.
5. Evaluating model performance using metrics like accuracy,
precision, recall, and F1-score.
6. Good accuracy model is chosen.
7. Providing recommendations based on model insights.
The ultimate goal is to help the telecom company proactively
identify customers at risk of leaving, allowing them to implement
targeted retention strategies and improve customer satisfaction.
7. II. DATA LOADING
• Importing the necessary libraries for data analysis and visualization,
ensuring that visualizations are displayed inline.
• Reading a CSV file located at the specified path and assigning it to a
pandas DataFrame called ‘telco_churn’ for further analysis.
• It is commonly used at the beginning of a data analysis and
machine learning project to set up the environment, loading the
dataset, and preparing for exploration and visualization. It is
particularly useful for interactive data analysis.
9. • The primary goals is to uncover patterns, relationships, anomalies, and
insights that can inform subsequent analysis.
• Looking at the dataset by using head( ), tail( ), sample( ), size( )
III. DATA EXPLORING
10. • Checking the various attributes of dataset like Shape (Total number of
Rows and Columns), Columns name, Datatypes of columns,
Dimensionality, Information(Memory size, Datatypes, NAN values),
Describe(Min,Max,Median,25 %,75 %,and so on...)
• describe() method is useful for quickly understanding the
distribution and central tendency of your numerical data.
We can see that the TotalCharges
is in numerical form but its
datatype shown as object.
11. • Checking value_counts(), nunique(), Duplicated().sum() ,isnull().sum()
OBSERVATION - In all the above shows that,
there was no column with name issue but
No internet service and No phone service
means the same as 'NO
nunique() - Returning a
series object that displays
the count of unique
values of each columns
OBSERVATION - There
is no missing values in
the above dataset
12. 1. The TotalCharges should be float or int but it is object so their
might be some missing values in this columns i.e we need to
change it into float or int.
• As There are whites spaces in the TotalCharges Column therefore
we cannot see the missing values.
1. In SeniorCitizen columns, It is actually a categorical, hence the
25%-50%-75% distribution is not proper.
2. In MonthlyCharges columns,Average Monthly charges are USD
64.76 whereas 75% customers pay more than USD 89.85 per
month.
3. No duplicated values.
OBSERVATION
13. 1. Creating a copy of telco_churn for manipulation & processing. So,
there is no data leakage.
2. Churn Column (Target Column)
Converting churn column a Categorical value to Numerical Value
IV. DATA CLEANING
14. • Displaying values of maximum and minimum
• Finding the percentage of the Churn Column
OBSERVATION -
• Data is highly Imbalanced, ratio = 73:27
• So we analyze the data with other features while taking the target values
• separately to get some insights.
15. 3. TotalCharges Column
Total Charges should be numeric amount. Converting it to numerical
data type.
OBSERVATION -
• top: " " (the most frequent value in the "Totalcharges" column is
white spaces)
• freq: 11 (the count of " " occurrences in the "TotalCharges" column
16. Here we will be filling the white spaces with NAN values.
Calculating the percentage of NAN values with respect to the total number
of rows.
As we can see there are 11 missing
values in TotalCharges column.
Let's check its records
OSERVATION - Since the % of these records compared to total dataset is very low i.e
0.16%, it is safe to fill them with 0 for further processing.
17. Missing Value Treatment
Checking the data type of the 'TotalCharges' column
OBSERVATION – Now treating the missing
values with 0 value. There is no missing
value left
18. 4. Tenure Column
Dividing customers into bins based on tenure. for e.g. for tenure < 12
months: assign a tenure group if 1-12, for tenure between 1 to 2 Years,
tenure group of 13-24; so on... (i.e - Grouping the tenure in bins of 12
months)
Dropping tenure column as we
already created a tenure_group.
IV.1. BINNING
19. 5. Customer-ID Column
6. Modifying Column
'No internet service' and 'No phone service' are not different from No
and can be replaced with "No"
20. Data visualization is the representation of data in graphical or visual
formats to communicate information effectively. It involves using charts,
graphs, maps, and other visual elements to convey patterns, trends, and
insights present in the data. It is a powerful tool for exploring,
interpreting, and presenting data in a way that is easily understandable.
Types of Data Visualization:
1. Univariate Analysis: Univariate analysis involves the examination of a
single variable or feature in isolation.
2. Bivariate Analysis: Bivariate analysis helps uncover patterns,
correlations, and dependencies between two variables.
V. DATA VISUALIZATION
21. V.1. UNIVARIATE ANALYSIS
1. 2.
3. 4.
OBSERVATIION - Customers with Fiber optic
Internet service type has churned more DSL is the
most popular internet service type.
OBSERVATION -Maximum Customers has not churned
i.e No-5174 & Less number of Customers has churned
i.e Yes-1869
OBSERVATION - Electronic check is 33.58% that is
more than other payment method OBSERVATION - Very less outliers in MonthlyCharges
22. 5.
OBSERVATION - The distribution appears to be right-skewed, with a
longer tail on the right side. This indicates that there are fewer
senior citizens in the dataset.
OBSERVATIION –
Customers with 1-12
tenure_group has
churned more
6.
7.
OBSERVATION - Male has 50.48 %
and Female has 49.52%
23. V.2. BIVARIATE ANALYSIS
1.
OBSERVATION - Tenure_group from Female
Category within 12 month (i.e 1 year) has
churned highly
2.
OBSERVATION – ’Month-to-month' contract has a
significantly higher bar, it suggests a higher churn rate
for customers mostly in gender female Because of no
contract terms, as they are free to go
24. 3.
OBSERVATION - Surprising insight as higher Churn at
lower Total Charges
OBSERVATION - Total Charges increase as Monthly Charges increase as
expected
5.
OBSERVATION - Churn is high when Monthly Charges are high
4.
25. • Tenure_group within 12 month (i.e 1 year) and Non senior Citizens
from female category has churned highly.
• 'Month-to-month' contract has a higher churn rate for customers
mostly in gender female. Because of no contract terms, as they are free
to go customers.
• Churn is high when Monthly Charges are high and Total Charges is low
but we see that between Total and Monthly charges when Total
Charges increase also Monthly Charges increases as well.
• Less number of Customers has churned i.e Yes - Count: 1869. Therefore
Data is highly Imbalanced in ratio = 73:27.
• Electronic check is 33.58% as it is the most common payment method
of churning more customers.
• The gender distribution is roughly balanced.
• Customers with Fiber optic Internet service type has churned more DSL
is the most popular internet service type.
• PhoneServices and Paperless billing customer that is chosen by a
significant number of customers has churned is less and not churned is
more.
CONCLUSION FOR DATA
VISUALIZATION
26. 1.Creating Binary Features: Converting categorical features like 'Partner',
'Dependents' into binary features (0 or 1).
2. Creating a Feature for Family Size: Combining information from
'Partner' and 'Dependents' to create a feature representing the size of the
customer's family.
VI. FEATURE ENGINEERING
27. 3. Creating a plot : To see which family size has churned more.
28. The goal of data preprocessing is to enhance the quality of the data,
remove any inconsistencies or errors, and prepare it for further analysis
or modeling.
Two Techniques of Feature Encoding are:
1. One-Hot Encoding - One-hot encoding is a method used to convert
categorical variables into a binary matrix (0s and 1s).
2. Label Encoding - Label encoding is another technique for
converting categorical data into a numerical format.
VII. DATA PREPROCESSING
FEATURE ENCODING
One-Hot
Encoding
Label
Encoding
31. 4. Correlation of the features with 'Churn‘
IDENTIFYING BEST FEATURE
This ‘Month-to-Month Contract‘ feature has the greatest influence among all features
32. 5. using HEATMAP, Correlation of the features with 'Churn‘ .
OBSERVATION -
• HIGH Churn seen in case of Month to month contracts.
• LOW Churn is seen in case of Long term contracts
• Factors like Gender, Availability of PhoneService and Number of multiple lines have
almost NO impact on Churn.
MULTIVARIATE ANALYSIS
33. This code randomly splits the dataset X (features) and y
(labels) into two separate sets: the training set (X_train and y_train) and the
testing set (X_test and y_test). The split is done with a test size of 0.2,
meaning that 20% of the data will be allocated for testing, while the
remaining 80% will be used for training. The random_state parameter is set
to ensure reproducibility of the split.
1. Splitting the telco_copy into X and y and then doing Train-Test Split.
VIII. TRAIN – TEST SPLIT
34. Scaling is performed to ensure that all numerical features in a
dataset are on a similar scale, avoiding biases, enabling fair comparisons,
and facilitating the convergence. It is a technique used in machine
learning to standardize or normalize the range of independent variables or
features of the dataset.
Methods of feature scaling
1. Standardization (Z-score Normalization):This code is an
implementation of the standardization (Z-score normalization) method
for feature scaling. Standardization scales the features so that they
have a mean of 0 and a standard deviation of 1.
IX. FEATURE SCALING
35. 1. Standard Scaling Analysis
• Scaling the numerical features
• Extracting numerical features for scaling
2. Fitting and transforming the training data, saving the scaling
parameters for future use in test data.
• Display the scaled training and test sets
36. 1. Before Scaling on Numerical_features
2. After Scaling
on Numerical_Features
37. • SMOTEENN is used to address imbalanced datasets by generating
synthetic examples for the minority class (SMOTE) and cleaning the
dataset to remove noise (ENN), ultimately leading to a more
balanced and representative dataset for model training. For instance,
in a binary classification problem, one class may have significantly
fewer instances than the other.
X. SMOTEENN
38. XI. MODEL BUILDING & EVALUATION
Random Forest
XGBoost Classifier
K-Nearest Neighbors
Classifier (KNN)
Decision Tree
Support Vector Classifier
(SVC)
39. • In Imbalanced data accuracy is cursed.
• As you can see that the accuracy is quite low, and as it's an
imbalanced dataset. Hence, we need to check recall, precision &
f1 score for the minority class, and it's quite evident that the
precision, recall & f1 score is too low for Class 1, i.e. churned
customers. Hence, moving ahead to call SMOTEENN
(OverSampling + ENN)
• After using SMOTEENN
41. • After evaluating different models for Churn detection, including Decision Tree, Random
Forest, K-Nearest Neighbors, Naïve Baye’s, XGBoost and SVC, it can be concluded that
the XGBoost model achieved the highest accuracy among the evaluated models, with
an accuracy score of 0.9689. XGBoost model is an ensemble learning method that
combines the predictions of multiple weak learners (typically decision trees) to create a
strong learner. This helps capture complex relationships in the data.
• The key importance lies in its ability to handle complex relationships in data, prevent
overfitting, handle missing values, and provide flexibility and customization for various
machine learning tasks.
• Combining XGBoost with SMOTEENN may enhance the model's performance on
imbalanced datasets. It helps the model better capture patterns in the minority class by
oversampling and cleaning the dataset.
CONCLUSION OF MODEL
COMPARISON
42. The best model is the XGBoost Classifier with highest
accuracy score of 0.9689
43. • Looking for maximum and minimum Models name with
Accuracy score
44. 1. As MonthlyCharges increases also TotalCharges Increases.
2. Customers with 'Month-to-month' contract has a higher churn
rate. Because of no contract terms, as they are free to go
customers.
3. Churn is high when Monthly Charges are high and Total
Charges is low
4. Electronic check is the most common payment method of
churning more customers.
5. Customers with Fiber optic Internet service type has churned
more DSL is the most popular internet service type.
6. PhoneServices and Paperless billing customer that is chosen
by a significant number of customers has churned very less.
7. XGBoost model achieved the highest accuracy among the
evaluated models.
OVERALL CONCLUSION