SlideShare uma empresa Scribd logo
Churn Analysis. Presented by :- PALLAVI MOHANTY
I. Introduction and Problem Statement
II. Data Loading
III. Data Exploring
IV. Data Cleaning
IV.1. Binning
V. Data Visualization
V.1. Univariate Analysis
V.2. Bivariate Analysis
VI. Feature Engineering
VII. Data Preprocessing
VIII. Train – Test Split
IX. Feature Scaling
X. Smoteenn
XI. Model Building and Evaluation
XII. Model Comparison
Q. What is Customer Churn?
• Customer churn is defined as when customers or subscribers
discontinue doing business with a firm or service
• Each row represents a customer, each column contains
customer’s attributes described on the column Metadata.
The data set includes information about:
• Customers who left within the last month – the column is called
Churn .
• Services that each customer has signed up for – phone, multiple
lines, internet, online security, online backup, device protection,
tech support, and streaming TV and movies.
• Customer account information – how long they’ve been a
customer, contract, payment method, paperless billing, monthly
charges, and total charges.
• Demographic information about customers – Customer ID,
gender, and if they have partners and dependents.
The target variable Telco Churn dataset typically revolves
around predicting customer churn. It has only two possible
outcomes: churn or not churn (Binary Classification). "Churn" refers
to the scenario where customers who are likely to cancel their
contracts soon. In the telecom industry, customer churn can be a
significant issue, as it can lead to revenue loss. If the company can
predict that, it can handle users before churn.
1. Exploratory Data Analysis (EDA) to understand data patterns
and relationships.
2. Data preprocessing, including handling missing values,
encoding categorical variables, and feature scaling.
3. Splitting the dataset into training and testing sets.
4. Building and training machine learning models for churn
5. Evaluating model performance using metrics like accuracy,
precision, recall, and F1-score.
6. Good accuracy model is chosen.
7. Providing recommendations based on model insights.
The ultimate goal is to help the telecom company proactively
identify customers at risk of leaving, allowing them to implement
targeted retention strategies and improve customer satisfaction.
• Importing the necessary libraries for data analysis and visualization,
ensuring that visualizations are displayed inline.
• Reading a CSV file located at the specified path and assigning it to a
pandas DataFrame called ‘telco_churn’ for further analysis.
• It is commonly used at the beginning of a data analysis and
machine learning project to set up the environment, loading the
dataset, and preparing for exploration and visualization. It is
particularly useful for interactive data analysis.
Displaying dataset of “telco_churn”
• The primary goals is to uncover patterns, relationships, anomalies, and
insights that can inform subsequent analysis.
• Looking at the dataset by using head( ), tail( ), sample( ), size( )
• Checking the various attributes of dataset like Shape (Total number of
Rows and Columns), Columns name, Datatypes of columns,
Dimensionality, Information(Memory size, Datatypes, NAN values),
Describe(Min,Max,Median,25 %,75 %,and so on...)
• describe() method is useful for quickly understanding the
distribution and central tendency of your numerical data.
We can see that the TotalCharges
is in numerical form but its
datatype shown as object.
• Checking value_counts(), nunique(), Duplicated().sum() ,isnull().sum()
OBSERVATION - In all the above shows that,
there was no column with name issue but
No internet service and No phone service
means the same as 'NO
nunique() - Returning a
series object that displays
the count of unique
values of each columns
is no missing values in
the above dataset
1. The TotalCharges should be float or int but it is object so their
might be some missing values in this columns i.e we need to
change it into float or int.
• As There are whites spaces in the TotalCharges Column therefore
we cannot see the missing values.
1. In SeniorCitizen columns, It is actually a categorical, hence the
25%-50%-75% distribution is not proper.
2. In MonthlyCharges columns,Average Monthly charges are USD
64.76 whereas 75% customers pay more than USD 89.85 per
3. No duplicated values.
1. Creating a copy of telco_churn for manipulation & processing. So,
there is no data leakage.
2. Churn Column (Target Column)
Converting churn column a Categorical value to Numerical Value
• Displaying values of maximum and minimum
• Finding the percentage of the Churn Column
• Data is highly Imbalanced, ratio = 73:27
• So we analyze the data with other features while taking the target values
• separately to get some insights.
3. TotalCharges Column
Total Charges should be numeric amount. Converting it to numerical
data type.
• top: " " (the most frequent value in the "Totalcharges" column is
white spaces)
• freq: 11 (the count of " " occurrences in the "TotalCharges" column
Here we will be filling the white spaces with NAN values.
Calculating the percentage of NAN values with respect to the total number
of rows.
As we can see there are 11 missing
values in TotalCharges column.
Let's check its records
OSERVATION - Since the % of these records compared to total dataset is very low i.e
0.16%, it is safe to fill them with 0 for further processing.
Missing Value Treatment
Checking the data type of the 'TotalCharges' column
OBSERVATION – Now treating the missing
values with 0 value. There is no missing
value left
4. Tenure Column
Dividing customers into bins based on tenure. for e.g. for tenure < 12
months: assign a tenure group if 1-12, for tenure between 1 to 2 Years,
tenure group of 13-24; so on... (i.e - Grouping the tenure in bins of 12
Dropping tenure column as we
already created a tenure_group.
5. Customer-ID Column
6. Modifying Column
'No internet service' and 'No phone service' are not different from No
and can be replaced with "No"
Data visualization is the representation of data in graphical or visual
formats to communicate information effectively. It involves using charts,
graphs, maps, and other visual elements to convey patterns, trends, and
insights present in the data. It is a powerful tool for exploring,
interpreting, and presenting data in a way that is easily understandable.
Types of Data Visualization:
1. Univariate Analysis: Univariate analysis involves the examination of a
single variable or feature in isolation.
2. Bivariate Analysis: Bivariate analysis helps uncover patterns,
correlations, and dependencies between two variables.
1. 2.
3. 4.
OBSERVATIION - Customers with Fiber optic
Internet service type has churned more DSL is the
most popular internet service type.
OBSERVATION -Maximum Customers has not churned
i.e No-5174 & Less number of Customers has churned
i.e Yes-1869
OBSERVATION - Electronic check is 33.58% that is
more than other payment method OBSERVATION - Very less outliers in MonthlyCharges
OBSERVATION - The distribution appears to be right-skewed, with a
longer tail on the right side. This indicates that there are fewer
senior citizens in the dataset.
Customers with 1-12
tenure_group has
churned more
OBSERVATION - Male has 50.48 %
and Female has 49.52%
OBSERVATION - Tenure_group from Female
Category within 12 month (i.e 1 year) has
churned highly
OBSERVATION – ’Month-to-month' contract has a
significantly higher bar, it suggests a higher churn rate
for customers mostly in gender female Because of no
contract terms, as they are free to go
OBSERVATION - Surprising insight as higher Churn at
lower Total Charges
OBSERVATION - Total Charges increase as Monthly Charges increase as
OBSERVATION - Churn is high when Monthly Charges are high
• Tenure_group within 12 month (i.e 1 year) and Non senior Citizens
from female category has churned highly.
• 'Month-to-month' contract has a higher churn rate for customers
mostly in gender female. Because of no contract terms, as they are free
to go customers.
• Churn is high when Monthly Charges are high and Total Charges is low
but we see that between Total and Monthly charges when Total
Charges increase also Monthly Charges increases as well.
• Less number of Customers has churned i.e Yes - Count: 1869. Therefore
Data is highly Imbalanced in ratio = 73:27.
• Electronic check is 33.58% as it is the most common payment method
of churning more customers.
• The gender distribution is roughly balanced.
• Customers with Fiber optic Internet service type has churned more DSL
is the most popular internet service type.
• PhoneServices and Paperless billing customer that is chosen by a
significant number of customers has churned is less and not churned is
1.Creating Binary Features: Converting categorical features like 'Partner',
'Dependents' into binary features (0 or 1).
2. Creating a Feature for Family Size: Combining information from
'Partner' and 'Dependents' to create a feature representing the size of the
customer's family.
3. Creating a plot : To see which family size has churned more.
The goal of data preprocessing is to enhance the quality of the data,
remove any inconsistencies or errors, and prepare it for further analysis
or modeling.
Two Techniques of Feature Encoding are:
1. One-Hot Encoding - One-hot encoding is a method used to convert
categorical variables into a binary matrix (0s and 1s).
2. Label Encoding - Label encoding is another technique for
converting categorical data into a numerical format.
1. One-Hot Encoding
2. Label Encoding
Data Displayed
4. Correlation of the features with 'Churn‘
This ‘Month-to-Month Contract‘ feature has the greatest influence among all features
5. using HEATMAP, Correlation of the features with 'Churn‘ .
• HIGH Churn seen in case of Month to month contracts.
• LOW Churn is seen in case of Long term contracts
• Factors like Gender, Availability of PhoneService and Number of multiple lines have
almost NO impact on Churn.
This code randomly splits the dataset X (features) and y
(labels) into two separate sets: the training set (X_train and y_train) and the
testing set (X_test and y_test). The split is done with a test size of 0.2,
meaning that 20% of the data will be allocated for testing, while the
remaining 80% will be used for training. The random_state parameter is set
to ensure reproducibility of the split.
1. Splitting the telco_copy into X and y and then doing Train-Test Split.
Scaling is performed to ensure that all numerical features in a
dataset are on a similar scale, avoiding biases, enabling fair comparisons,
and facilitating the convergence. It is a technique used in machine
learning to standardize or normalize the range of independent variables or
features of the dataset.
Methods of feature scaling
1. Standardization (Z-score Normalization):This code is an
implementation of the standardization (Z-score normalization) method
for feature scaling. Standardization scales the features so that they
have a mean of 0 and a standard deviation of 1.
1. Standard Scaling Analysis
• Scaling the numerical features
• Extracting numerical features for scaling
2. Fitting and transforming the training data, saving the scaling
parameters for future use in test data.
• Display the scaled training and test sets
1. Before Scaling on Numerical_features
2. After Scaling
on Numerical_Features
• SMOTEENN is used to address imbalanced datasets by generating
synthetic examples for the minority class (SMOTE) and cleaning the
dataset to remove noise (ENN), ultimately leading to a more
balanced and representative dataset for model training. For instance,
in a binary classification problem, one class may have significantly
fewer instances than the other.
Random Forest
XGBoost Classifier
K-Nearest Neighbors
Classifier (KNN)
Decision Tree
Support Vector Classifier
• In Imbalanced data accuracy is cursed.
• As you can see that the accuracy is quite low, and as it's an
imbalanced dataset. Hence, we need to check recall, precision &
f1 score for the minority class, and it's quite evident that the
precision, recall & f1 score is too low for Class 1, i.e. churned
customers. Hence, moving ahead to call SMOTEENN
(OverSampling + ENN)
• After using SMOTEENN
• After evaluating different models for Churn detection, including Decision Tree, Random
Forest, K-Nearest Neighbors, Naïve Baye’s, XGBoost and SVC, it can be concluded that
the XGBoost model achieved the highest accuracy among the evaluated models, with
an accuracy score of 0.9689. XGBoost model is an ensemble learning method that
combines the predictions of multiple weak learners (typically decision trees) to create a
strong learner. This helps capture complex relationships in the data.
• The key importance lies in its ability to handle complex relationships in data, prevent
overfitting, handle missing values, and provide flexibility and customization for various
machine learning tasks.
• Combining XGBoost with SMOTEENN may enhance the model's performance on
imbalanced datasets. It helps the model better capture patterns in the minority class by
oversampling and cleaning the dataset.
The best model is the XGBoost Classifier with highest
accuracy score of 0.9689
• Looking for maximum and minimum Models name with
Accuracy score
1. As MonthlyCharges increases also TotalCharges Increases.
2. Customers with 'Month-to-month' contract has a higher churn
rate. Because of no contract terms, as they are free to go
3. Churn is high when Monthly Charges are high and Total
Charges is low
4. Electronic check is the most common payment method of
churning more customers.
5. Customers with Fiber optic Internet service type has churned
more DSL is the most popular internet service type.
6. PhoneServices and Paperless billing customer that is chosen
by a significant number of customers has churned very less.
7. XGBoost model achieved the highest accuracy among the
evaluated models.

Mais conteúdo relacionado

Semelhante a Decoding Patterns: Customer Churn Prediction Data Analysis Project

Report 190804110930
Report 190804110930Report 190804110930
Report 190804110930
Computing Ratings and Rankings by Mining Feedback Comments
Computing Ratings and Rankings by Mining Feedback CommentsComputing Ratings and Rankings by Mining Feedback Comments
Computing Ratings and Rankings by Mining Feedback Comments
IRJET Journal
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
Boston Institute of Analytics
Data Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersData Mining to Classify Telco Churners
Data Mining to Classify Telco Churners
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
Boston Institute of Analytics
Classification Problem with KNN
Classification Problem with KNNClassification Problem with KNN
Classification Problem with KNN
Case Study: It’s All About Data – And the Customer
Case Study: It’s All About Data – And the CustomerCase Study: It’s All About Data – And the Customer
Case Study: It’s All About Data – And the Customer
Jill Kirkpatrick
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Boston Institute of Analytics
Data mining and analysis of customer churn dataset
Data mining and analysis of customer churn datasetData mining and analysis of customer churn dataset
Data mining and analysis of customer churn dataset
Rohan Choksi
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting Defaulters
IRJET Journal
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
Bank churn with Data Science
Bank churn with Data ScienceBank churn with Data Science
Bank churn with Data Science
Carolyn Knight
a hybrid approach to power theft detection
a hybrid approach to power theft detectiona hybrid approach to power theft detection
a hybrid approach to power theft detection
Cross selling credit card to existing debit card customers
Cross selling credit card to existing debit card customersCross selling credit card to existing debit card customers
Cross selling credit card to existing debit card customers
Saurabh Singh
Project crm submission sonali
Project crm submission sonaliProject crm submission sonali
Project crm submission sonali
Sonali Gupta
Online Service Rating Prediction by Removing Paid Users and Jaccard Coefficient
Online Service Rating Prediction by Removing Paid Users and Jaccard CoefficientOnline Service Rating Prediction by Removing Paid Users and Jaccard Coefficient
Online Service Rating Prediction by Removing Paid Users and Jaccard Coefficient
IRJET Journal
2014 cs data collection guide (1)
2014 cs data collection guide (1)2014 cs data collection guide (1)
2014 cs data collection guide (1)
Tamer Turgut
Chap7-Multidimensional data modeling.pptx
Chap7-Multidimensional data modeling.pptxChap7-Multidimensional data modeling.pptx
Chap7-Multidimensional data modeling.pptx
2012 cs-data-collection-guide
2012 cs-data-collection-guide2012 cs-data-collection-guide
2012 cs-data-collection-guide

Semelhante a Decoding Patterns: Customer Churn Prediction Data Analysis Project (20)

Report 190804110930
Report 190804110930Report 190804110930
Report 190804110930
Computing Ratings and Rankings by Mining Feedback Comments
Computing Ratings and Rankings by Mining Feedback CommentsComputing Ratings and Rankings by Mining Feedback Comments
Computing Ratings and Rankings by Mining Feedback Comments
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
Data Mining to Classify Telco Churners
Data Mining to Classify Telco ChurnersData Mining to Classify Telco Churners
Data Mining to Classify Telco Churners
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
Classification Problem with KNN
Classification Problem with KNNClassification Problem with KNN
Classification Problem with KNN
Case Study: It’s All About Data – And the Customer
Case Study: It’s All About Data – And the CustomerCase Study: It’s All About Data – And the Customer
Case Study: It’s All About Data – And the Customer
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Data mining and analysis of customer churn dataset
Data mining and analysis of customer churn datasetData mining and analysis of customer churn dataset
Data mining and analysis of customer churn dataset
Loan Analysis Predicting Defaulters
Loan Analysis Predicting DefaultersLoan Analysis Predicting Defaulters
Loan Analysis Predicting Defaulters
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
Bank churn with Data Science
Bank churn with Data ScienceBank churn with Data Science
Bank churn with Data Science
a hybrid approach to power theft detection
a hybrid approach to power theft detectiona hybrid approach to power theft detection
a hybrid approach to power theft detection
Cross selling credit card to existing debit card customers
Cross selling credit card to existing debit card customersCross selling credit card to existing debit card customers
Cross selling credit card to existing debit card customers
Project crm submission sonali
Project crm submission sonaliProject crm submission sonali
Project crm submission sonali
Online Service Rating Prediction by Removing Paid Users and Jaccard Coefficient
Online Service Rating Prediction by Removing Paid Users and Jaccard CoefficientOnline Service Rating Prediction by Removing Paid Users and Jaccard Coefficient
Online Service Rating Prediction by Removing Paid Users and Jaccard Coefficient
2014 cs data collection guide (1)
2014 cs data collection guide (1)2014 cs data collection guide (1)
2014 cs data collection guide (1)
Chap7-Multidimensional data modeling.pptx
Chap7-Multidimensional data modeling.pptxChap7-Multidimensional data modeling.pptx
Chap7-Multidimensional data modeling.pptx
2012 cs-data-collection-guide
2012 cs-data-collection-guide2012 cs-data-collection-guide
2012 cs-data-collection-guide

Mais de Boston Institute of Analytics

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
Solar production with K means clustering
Solar production with K means clusteringSolar production with K means clustering
Solar production with K means clustering
Boston Institute of Analytics
Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary RangesDemystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Boston Institute of Analytics
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
Boston Institute of Analytics
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Boston Institute of Analytics
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Boston Institute of Analytics
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Boston Institute of Analytics
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
Boston Institute of Analytics
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
Boston Institute of Analytics
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Boston Institute of Analytics
Unveiling the Patterns: A Cluster Analysis of NYC Shootings
Unveiling the Patterns: A Cluster Analysis of NYC ShootingsUnveiling the Patterns: A Cluster Analysis of NYC Shootings
Unveiling the Patterns: A Cluster Analysis of NYC Shootings
Boston Institute of Analytics
Enhancing Cybersecurity: An In-depth Analysis of
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of
Enhancing Cybersecurity: An In-depth Analysis of
Boston Institute of Analytics
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Boston Institute of Analytics
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
Boston Institute of Analytics
Detecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven ApproachDetecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven Approach
Boston Institute of Analytics
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
Boston Institute of Analytics
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
Boston Institute of Analytics
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Boston Institute of Analytics
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
Boston Institute of Analytics
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
Boston Institute of Analytics

Mais de Boston Institute of Analytics (20)

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Solar production with K means clustering
Solar production with K means clusteringSolar production with K means clustering
Solar production with K means clustering
Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary RangesDemystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Unveiling the Patterns: A Cluster Analysis of NYC Shootings
Unveiling the Patterns: A Cluster Analysis of NYC ShootingsUnveiling the Patterns: A Cluster Analysis of NYC Shootings
Unveiling the Patterns: A Cluster Analysis of NYC Shootings
Enhancing Cybersecurity: An In-depth Analysis of
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of
Enhancing Cybersecurity: An In-depth Analysis of
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven ApproachDetecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven Approach
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx


Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul

Último (20)

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......

Decoding Patterns: Customer Churn Prediction Data Analysis Project

  • 1.
  • 2. CAPSTONE PROJECT TITLE: Customer Churn Analysis. Presented by :- PALLAVI MOHANTY
  • 3. PROJECT CONTENT I. Introduction and Problem Statement II. Data Loading III. Data Exploring IV. Data Cleaning IV.1. Binning V. Data Visualization V.1. Univariate Analysis V.2. Bivariate Analysis VI. Feature Engineering VII. Data Preprocessing VIII. Train – Test Split IX. Feature Scaling X. Smoteenn XI. Model Building and Evaluation XII. Model Comparison CUSTOMER CHURN
  • 4. I. INTRODUCTION Q. What is Customer Churn? • Customer churn is defined as when customers or subscribers discontinue doing business with a firm or service • Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about: • Customers who left within the last month – the column is called Churn . • Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies. • Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges. • Demographic information about customers – Customer ID, gender, and if they have partners and dependents. THIS IS A CLASSIC TELECOM CHURN USECASE.
  • 5. PROBLEM STATEMENT The target variable Telco Churn dataset typically revolves around predicting customer churn. It has only two possible outcomes: churn or not churn (Binary Classification). "Churn" refers to the scenario where customers who are likely to cancel their contracts soon. In the telecom industry, customer churn can be a significant issue, as it can lead to revenue loss. If the company can predict that, it can handle users before churn.
  • 6. APPROACH TO SOLVE PROBLEM STATEMENT 1. Exploratory Data Analysis (EDA) to understand data patterns and relationships. 2. Data preprocessing, including handling missing values, encoding categorical variables, and feature scaling. 3. Splitting the dataset into training and testing sets. 4. Building and training machine learning models for churn prediction. 5. Evaluating model performance using metrics like accuracy, precision, recall, and F1-score. 6. Good accuracy model is chosen. 7. Providing recommendations based on model insights. The ultimate goal is to help the telecom company proactively identify customers at risk of leaving, allowing them to implement targeted retention strategies and improve customer satisfaction.
  • 7. II. DATA LOADING • Importing the necessary libraries for data analysis and visualization, ensuring that visualizations are displayed inline. • Reading a CSV file located at the specified path and assigning it to a pandas DataFrame called ‘telco_churn’ for further analysis. • It is commonly used at the beginning of a data analysis and machine learning project to set up the environment, loading the dataset, and preparing for exploration and visualization. It is particularly useful for interactive data analysis.
  • 8. Displaying dataset of “telco_churn”
  • 9. • The primary goals is to uncover patterns, relationships, anomalies, and insights that can inform subsequent analysis. • Looking at the dataset by using head( ), tail( ), sample( ), size( ) III. DATA EXPLORING
  • 10. • Checking the various attributes of dataset like Shape (Total number of Rows and Columns), Columns name, Datatypes of columns, Dimensionality, Information(Memory size, Datatypes, NAN values), Describe(Min,Max,Median,25 %,75 %,and so on...) • describe() method is useful for quickly understanding the distribution and central tendency of your numerical data. We can see that the TotalCharges is in numerical form but its datatype shown as object.
  • 11. • Checking value_counts(), nunique(), Duplicated().sum() ,isnull().sum() OBSERVATION - In all the above shows that, there was no column with name issue but No internet service and No phone service means the same as 'NO nunique() - Returning a series object that displays the count of unique values of each columns OBSERVATION - There is no missing values in the above dataset
  • 12. 1. The TotalCharges should be float or int but it is object so their might be some missing values in this columns i.e we need to change it into float or int. • As There are whites spaces in the TotalCharges Column therefore we cannot see the missing values. 1. In SeniorCitizen columns, It is actually a categorical, hence the 25%-50%-75% distribution is not proper. 2. In MonthlyCharges columns,Average Monthly charges are USD 64.76 whereas 75% customers pay more than USD 89.85 per month. 3. No duplicated values. OBSERVATION
  • 13. 1. Creating a copy of telco_churn for manipulation & processing. So, there is no data leakage. 2. Churn Column (Target Column) Converting churn column a Categorical value to Numerical Value IV. DATA CLEANING
  • 14. • Displaying values of maximum and minimum • Finding the percentage of the Churn Column OBSERVATION - • Data is highly Imbalanced, ratio = 73:27 • So we analyze the data with other features while taking the target values • separately to get some insights.
  • 15. 3. TotalCharges Column Total Charges should be numeric amount. Converting it to numerical data type. OBSERVATION - • top: " " (the most frequent value in the "Totalcharges" column is white spaces) • freq: 11 (the count of " " occurrences in the "TotalCharges" column
  • 16. Here we will be filling the white spaces with NAN values. Calculating the percentage of NAN values with respect to the total number of rows. As we can see there are 11 missing values in TotalCharges column. Let's check its records OSERVATION - Since the % of these records compared to total dataset is very low i.e 0.16%, it is safe to fill them with 0 for further processing.
  • 17. Missing Value Treatment Checking the data type of the 'TotalCharges' column OBSERVATION – Now treating the missing values with 0 value. There is no missing value left
  • 18. 4. Tenure Column Dividing customers into bins based on tenure. for e.g. for tenure < 12 months: assign a tenure group if 1-12, for tenure between 1 to 2 Years, tenure group of 13-24; so on... (i.e - Grouping the tenure in bins of 12 months) Dropping tenure column as we already created a tenure_group. IV.1. BINNING
  • 19. 5. Customer-ID Column 6. Modifying Column 'No internet service' and 'No phone service' are not different from No and can be replaced with "No"
  • 20. Data visualization is the representation of data in graphical or visual formats to communicate information effectively. It involves using charts, graphs, maps, and other visual elements to convey patterns, trends, and insights present in the data. It is a powerful tool for exploring, interpreting, and presenting data in a way that is easily understandable. Types of Data Visualization: 1. Univariate Analysis: Univariate analysis involves the examination of a single variable or feature in isolation. 2. Bivariate Analysis: Bivariate analysis helps uncover patterns, correlations, and dependencies between two variables. V. DATA VISUALIZATION
  • 21. V.1. UNIVARIATE ANALYSIS 1. 2. 3. 4. OBSERVATIION - Customers with Fiber optic Internet service type has churned more DSL is the most popular internet service type. OBSERVATION -Maximum Customers has not churned i.e No-5174 & Less number of Customers has churned i.e Yes-1869 OBSERVATION - Electronic check is 33.58% that is more than other payment method OBSERVATION - Very less outliers in MonthlyCharges
  • 22. 5. OBSERVATION - The distribution appears to be right-skewed, with a longer tail on the right side. This indicates that there are fewer senior citizens in the dataset. OBSERVATIION – Customers with 1-12 tenure_group has churned more 6. 7. OBSERVATION - Male has 50.48 % and Female has 49.52%
  • 23. V.2. BIVARIATE ANALYSIS 1. OBSERVATION - Tenure_group from Female Category within 12 month (i.e 1 year) has churned highly 2. OBSERVATION – ’Month-to-month' contract has a significantly higher bar, it suggests a higher churn rate for customers mostly in gender female Because of no contract terms, as they are free to go
  • 24. 3. OBSERVATION - Surprising insight as higher Churn at lower Total Charges OBSERVATION - Total Charges increase as Monthly Charges increase as expected 5. OBSERVATION - Churn is high when Monthly Charges are high 4.
  • 25. • Tenure_group within 12 month (i.e 1 year) and Non senior Citizens from female category has churned highly. • 'Month-to-month' contract has a higher churn rate for customers mostly in gender female. Because of no contract terms, as they are free to go customers. • Churn is high when Monthly Charges are high and Total Charges is low but we see that between Total and Monthly charges when Total Charges increase also Monthly Charges increases as well. • Less number of Customers has churned i.e Yes - Count: 1869. Therefore Data is highly Imbalanced in ratio = 73:27. • Electronic check is 33.58% as it is the most common payment method of churning more customers. • The gender distribution is roughly balanced. • Customers with Fiber optic Internet service type has churned more DSL is the most popular internet service type. • PhoneServices and Paperless billing customer that is chosen by a significant number of customers has churned is less and not churned is more. CONCLUSION FOR DATA VISUALIZATION
  • 26. 1.Creating Binary Features: Converting categorical features like 'Partner', 'Dependents' into binary features (0 or 1). 2. Creating a Feature for Family Size: Combining information from 'Partner' and 'Dependents' to create a feature representing the size of the customer's family. VI. FEATURE ENGINEERING
  • 27. 3. Creating a plot : To see which family size has churned more.
  • 28. The goal of data preprocessing is to enhance the quality of the data, remove any inconsistencies or errors, and prepare it for further analysis or modeling. Two Techniques of Feature Encoding are: 1. One-Hot Encoding - One-hot encoding is a method used to convert categorical variables into a binary matrix (0s and 1s). 2. Label Encoding - Label encoding is another technique for converting categorical data into a numerical format. VII. DATA PREPROCESSING FEATURE ENCODING One-Hot Encoding Label Encoding
  • 29. 1. One-Hot Encoding 2. Label Encoding
  • 31. 4. Correlation of the features with 'Churn‘ IDENTIFYING BEST FEATURE This ‘Month-to-Month Contract‘ feature has the greatest influence among all features
  • 32. 5. using HEATMAP, Correlation of the features with 'Churn‘ . OBSERVATION - • HIGH Churn seen in case of Month to month contracts. • LOW Churn is seen in case of Long term contracts • Factors like Gender, Availability of PhoneService and Number of multiple lines have almost NO impact on Churn. MULTIVARIATE ANALYSIS
  • 33. This code randomly splits the dataset X (features) and y (labels) into two separate sets: the training set (X_train and y_train) and the testing set (X_test and y_test). The split is done with a test size of 0.2, meaning that 20% of the data will be allocated for testing, while the remaining 80% will be used for training. The random_state parameter is set to ensure reproducibility of the split. 1. Splitting the telco_copy into X and y and then doing Train-Test Split. VIII. TRAIN – TEST SPLIT
  • 34. Scaling is performed to ensure that all numerical features in a dataset are on a similar scale, avoiding biases, enabling fair comparisons, and facilitating the convergence. It is a technique used in machine learning to standardize or normalize the range of independent variables or features of the dataset. Methods of feature scaling 1. Standardization (Z-score Normalization):This code is an implementation of the standardization (Z-score normalization) method for feature scaling. Standardization scales the features so that they have a mean of 0 and a standard deviation of 1. IX. FEATURE SCALING
  • 35. 1. Standard Scaling Analysis • Scaling the numerical features • Extracting numerical features for scaling 2. Fitting and transforming the training data, saving the scaling parameters for future use in test data. • Display the scaled training and test sets
  • 36. 1. Before Scaling on Numerical_features 2. After Scaling on Numerical_Features
  • 37. • SMOTEENN is used to address imbalanced datasets by generating synthetic examples for the minority class (SMOTE) and cleaning the dataset to remove noise (ENN), ultimately leading to a more balanced and representative dataset for model training. For instance, in a binary classification problem, one class may have significantly fewer instances than the other. X. SMOTEENN
  • 38. XI. MODEL BUILDING & EVALUATION Random Forest XGBoost Classifier K-Nearest Neighbors Classifier (KNN) Decision Tree Support Vector Classifier (SVC)
  • 39. • In Imbalanced data accuracy is cursed. • As you can see that the accuracy is quite low, and as it's an imbalanced dataset. Hence, we need to check recall, precision & f1 score for the minority class, and it's quite evident that the precision, recall & f1 score is too low for Class 1, i.e. churned customers. Hence, moving ahead to call SMOTEENN (OverSampling + ENN) • After using SMOTEENN
  • 41. • After evaluating different models for Churn detection, including Decision Tree, Random Forest, K-Nearest Neighbors, Naïve Baye’s, XGBoost and SVC, it can be concluded that the XGBoost model achieved the highest accuracy among the evaluated models, with an accuracy score of 0.9689. XGBoost model is an ensemble learning method that combines the predictions of multiple weak learners (typically decision trees) to create a strong learner. This helps capture complex relationships in the data. • The key importance lies in its ability to handle complex relationships in data, prevent overfitting, handle missing values, and provide flexibility and customization for various machine learning tasks. • Combining XGBoost with SMOTEENN may enhance the model's performance on imbalanced datasets. It helps the model better capture patterns in the minority class by oversampling and cleaning the dataset. CONCLUSION OF MODEL COMPARISON
  • 42. The best model is the XGBoost Classifier with highest accuracy score of 0.9689
  • 43. • Looking for maximum and minimum Models name with Accuracy score
  • 44. 1. As MonthlyCharges increases also TotalCharges Increases. 2. Customers with 'Month-to-month' contract has a higher churn rate. Because of no contract terms, as they are free to go customers. 3. Churn is high when Monthly Charges are high and Total Charges is low 4. Electronic check is the most common payment method of churning more customers. 5. Customers with Fiber optic Internet service type has churned more DSL is the most popular internet service type. 6. PhoneServices and Paperless billing customer that is chosen by a significant number of customers has churned very less. 7. XGBoost model achieved the highest accuracy among the evaluated models. OVERALL CONCLUSION