SlideShare uma empresa Scribd logo
1 de 26
Group 4
Team members
Ravi
Richa
Sabarish
Vijay
Problem Statement
Flight Delay Prediction
Dataset understanding and description
No. of features =29, shape of data set = 484551 rows x 28 columns,
Target Variable = ArrDelay
Dataset understanding and description
 Missing Values
 Org_Airport -1177
 Dest_Airport -1479
 Duplicate Data
 2 rows are duplicate
 Duplicate columns
 Columns with same information i.e Org_Airport and Dest_Airport these are repeated
information in data set Origin represented by three letter code for Org_Airport and Dest
represents three letter code for Dest_Airport
 Categorical Variables
 1.UniqueCarrier
 2.FlightNum
 3.TailNum
 4.Origin
 5.Dest
Outliers
# -Columns having outliers :-
 arrdelay
 deepdelay
 taxiout
 carrier delay
 security delay
 late aircraft delay
Data Visualization Techniques used
 Box Plots
 Heat Maps
 Histogram
 Line graphs
 Pie chart
 PairPlot
 Used sweetviz library for more visualization
Screenshots of different visualizations
When we took threshold as .9 then following
features are correlated{'AirTime',
'CRSElapsedTime', 'DepDelay', 'Distance'}
EDA
flight_eda_report.html
Data Processing and FeatureEngg.
 Imputation – Check for Null values in data set and handle it by diff techniques
 Categorical Encoding – target encoding is used as high cardinality
 Handling Outliers – Box plots to analyse outliers in data
 Scaling - Min Max scaler is used
 Feature Selection- Correlation helped in getting feature correlation with each
other and target
 Feature Split – Derived features from Date and time features
Data set post FE – 24 features for Modelling
Our Research
 Analysed the data deeply from domain perspective which gave very interesting insights.
 Different types of categorical handling we have researched on and then came up with target encoding
 As we know deletion of features are always most impactful decision we used both visualization and
domain knowledge to do this part
 Missing Value handling we tried different approached and then finalized one
 We have derived attributes from given features which we really felt will be helpful for further analysing
Future Tasks
 More Feature Engineering
 Training the model on the selected features
 Model development
 Model assessment
Take away from last Meet
 Group Dynamics
 Elements of Data
 Dynamic data
Elements of Data
Feature Engg.
Logistic Regression – SFS for Feature Engineering
Logistic Regression – SFS for Feature
Engineering
Splitting of data
The test_size=0.2 It is split of test and training data as 80/20percent .
X_train data shape after splitting (387639, 24)
X_test data shape after splitting (96910, 24)
y_train data shape after splitting (387639,)
y_test data shape after splitting (96910,)
Linear Regression Model
Interpretation - The R² represents how much variance of the data is explained by
the model, the R2=0.90 means that 0.10 of the variance can not explain by the
model, the logical case when R2=1 the model completely fit and explained all
variance.
 Y = a + bX
 b = slope
 a = intercept
 X= coefficients or features
Ridge Regression Model
Ridge regression is a model tuning method that is used to analyse any data
that suffers from multicollinearity. This method performs L2 regularization.
When the issue of multicollinearity occurs, least-squares are unbiased, and
variances are large, this results in predicted values to be far away from the
actual values.
mean_squared_error with Ridge Regression with train data
0.0033528868720357806
R2 square with Ridge Regression with train data 0.9999999965474273
mean_squared_error with Ridge Regression with test data
0.0027463365607708775
R2 square with Ridge Regression with test data 0.9999999976478994
SVC
Random Forest Regressor
Neural Network
Future Recommendation
1. Regression vs Classification Problem
2. Dataset can have more records for delay = 0
3. Dataset can have more relevant features according to the domain
knowledge/experience
Comparison of Models
Interpretation
 Simple linear regression led to overfitting giving an unrealistic accuracy of 100%. This problem caused by
overfitting is well addressed by applying regularization on the regression model.We have used L2 Regularization
that is Ridge Regression to overcome this issue.
 SVM model is extremely unsuitable for this problem as it takes an unreasonable amount of time(near about 3
hours) to run the model and also gives subpar accuracy. It is computationally expensive and inappropriate for
problems with large datasets such as the one given.
 Random forest is also giving us good accuracy 98 %
 ANN is giving 98 %
Dynamic data
Dynamic data out using linear regression and Ridge Regression

Mais conteúdo relacionado

Semelhante a casestudy_important.pptx

LIDAR- Light Detection and Ranging.
LIDAR- Light Detection and Ranging.LIDAR- Light Detection and Ranging.
LIDAR- Light Detection and Ranging.Gaurav Agarwal
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataIRJET Journal
 
Svd filtered temporal usage clustering
Svd filtered temporal usage clusteringSvd filtered temporal usage clustering
Svd filtered temporal usage clusteringLiang Xie, PhD
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingIRJET Journal
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisDr.Pooja Jain
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET Journal
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171Yaxin Liu
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Chakkrit (Kla) Tantithamthavorn
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation SaravanakumarSekar4
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Data science with R - Clustering and Classification
Data science with R - Clustering and ClassificationData science with R - Clustering and Classification
Data science with R - Clustering and ClassificationBrigitte Mueller
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...IRJET Journal
 
Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...csandit
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET Journal
 
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...csandit
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral ResearchPo-Ting Wu
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee AttritionShruti Mohan
 

Semelhante a casestudy_important.pptx (20)

LIDAR- Light Detection and Ranging.
LIDAR- Light Detection and Ranging.LIDAR- Light Detection and Ranging.
LIDAR- Light Detection and Ranging.
 
Human Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerDataHuman Activity Recognition Using AccelerometerData
Human Activity Recognition Using AccelerometerData
 
Svd filtered temporal usage clustering
Svd filtered temporal usage clusteringSvd filtered temporal usage clustering
Svd filtered temporal usage clustering
 
Parallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive IndexingParallel KNN for Big Data using Adaptive Indexing
Parallel KNN for Big Data using Adaptive Indexing
 
Application of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosisApplication of combined support vector machines in process fault diagnosis
Application of combined support vector machines in process fault diagnosis
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
IRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification AlgorithmsIRJET- Performance Evaluation of Various Classification Algorithms
IRJET- Performance Evaluation of Various Classification Algorithms
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
 
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
Software Analytics In Action: A Hands-on Tutorial on Mining, Analyzing, Model...
 
Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation Data Assessment and Analysis for Model Evaluation
Data Assessment and Analysis for Model Evaluation
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Data science with R - Clustering and Classification
Data science with R - Clustering and ClassificationData science with R - Clustering and Classification
Data science with R - Clustering and Classification
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
 
Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...Backtracking based integer factorisation, primality testing and square root c...
Backtracking based integer factorisation, primality testing and square root c...
 
IRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning AlgorithmIRJET - Stock Market Prediction using Machine Learning Algorithm
IRJET - Stock Market Prediction using Machine Learning Algorithm
 
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
A FLOATING POINT DIVISION UNIT BASED ON TAYLOR-SERIES EXPANSION ALGORITHM AND...
 
My Postdoctoral Research
My Postdoctoral ResearchMy Postdoctoral Research
My Postdoctoral Research
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Oct.22nd.Presentation.Final
Oct.22nd.Presentation.FinalOct.22nd.Presentation.Final
Oct.22nd.Presentation.Final
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
 

Último

How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIShubhangi Sonawane
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin ClassesCeline George
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.christianmathematics
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 

Último (20)

How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 

casestudy_important.pptx

  • 3. Dataset understanding and description No. of features =29, shape of data set = 484551 rows x 28 columns, Target Variable = ArrDelay
  • 4. Dataset understanding and description  Missing Values  Org_Airport -1177  Dest_Airport -1479  Duplicate Data  2 rows are duplicate  Duplicate columns  Columns with same information i.e Org_Airport and Dest_Airport these are repeated information in data set Origin represented by three letter code for Org_Airport and Dest represents three letter code for Dest_Airport  Categorical Variables  1.UniqueCarrier  2.FlightNum  3.TailNum  4.Origin  5.Dest
  • 5. Outliers # -Columns having outliers :-  arrdelay  deepdelay  taxiout  carrier delay  security delay  late aircraft delay
  • 6. Data Visualization Techniques used  Box Plots  Heat Maps  Histogram  Line graphs  Pie chart  PairPlot  Used sweetviz library for more visualization
  • 7. Screenshots of different visualizations When we took threshold as .9 then following features are correlated{'AirTime', 'CRSElapsedTime', 'DepDelay', 'Distance'}
  • 9. Data Processing and FeatureEngg.  Imputation – Check for Null values in data set and handle it by diff techniques  Categorical Encoding – target encoding is used as high cardinality  Handling Outliers – Box plots to analyse outliers in data  Scaling - Min Max scaler is used  Feature Selection- Correlation helped in getting feature correlation with each other and target  Feature Split – Derived features from Date and time features Data set post FE – 24 features for Modelling
  • 10. Our Research  Analysed the data deeply from domain perspective which gave very interesting insights.  Different types of categorical handling we have researched on and then came up with target encoding  As we know deletion of features are always most impactful decision we used both visualization and domain knowledge to do this part  Missing Value handling we tried different approached and then finalized one  We have derived attributes from given features which we really felt will be helpful for further analysing
  • 11. Future Tasks  More Feature Engineering  Training the model on the selected features  Model development  Model assessment Take away from last Meet  Group Dynamics  Elements of Data  Dynamic data
  • 13.
  • 14. Feature Engg. Logistic Regression – SFS for Feature Engineering
  • 15. Logistic Regression – SFS for Feature Engineering
  • 16. Splitting of data The test_size=0.2 It is split of test and training data as 80/20percent . X_train data shape after splitting (387639, 24) X_test data shape after splitting (96910, 24) y_train data shape after splitting (387639,) y_test data shape after splitting (96910,)
  • 17. Linear Regression Model Interpretation - The R² represents how much variance of the data is explained by the model, the R2=0.90 means that 0.10 of the variance can not explain by the model, the logical case when R2=1 the model completely fit and explained all variance.
  • 18.  Y = a + bX  b = slope  a = intercept  X= coefficients or features
  • 19. Ridge Regression Model Ridge regression is a model tuning method that is used to analyse any data that suffers from multicollinearity. This method performs L2 regularization. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted values to be far away from the actual values. mean_squared_error with Ridge Regression with train data 0.0033528868720357806 R2 square with Ridge Regression with train data 0.9999999965474273 mean_squared_error with Ridge Regression with test data 0.0027463365607708775 R2 square with Ridge Regression with test data 0.9999999976478994
  • 20. SVC
  • 23. Future Recommendation 1. Regression vs Classification Problem 2. Dataset can have more records for delay = 0 3. Dataset can have more relevant features according to the domain knowledge/experience
  • 25. Interpretation  Simple linear regression led to overfitting giving an unrealistic accuracy of 100%. This problem caused by overfitting is well addressed by applying regularization on the regression model.We have used L2 Regularization that is Ridge Regression to overcome this issue.  SVM model is extremely unsuitable for this problem as it takes an unreasonable amount of time(near about 3 hours) to run the model and also gives subpar accuracy. It is computationally expensive and inappropriate for problems with large datasets such as the one given.  Random forest is also giving us good accuracy 98 %  ANN is giving 98 %
  • 26. Dynamic data Dynamic data out using linear regression and Ridge Regression