O slideshow foi denunciado.
Utilizamos seu perfil e dados de atividades no LinkedIn para personalizar e exibir anúncios mais relevantes. Altere suas preferências de anúncios quando desejar.

Airline flights delay prediction- 2014 Spring Data Mining Project

6.588 visualizações

Publicada em

School Project for graduate school. Can be further improved with weather data.

Publicada em: Educação
  • Seja o primeiro a comentar

Airline flights delay prediction- 2014 Spring Data Mining Project

  1. 1. Flight Delay Prediction Model Vishwanath K, Viral Tarpara, Haozhe Wang, Ling Zhou
  2. 2. Business Problem Overview Flight delay is a challenging problem for all airline companies, which will lead to ● Financial losses. ● Negative impact on their business reputation. $32.9B $8.3B $16.7B $3.9B $4B Cost of Delays in the US Cost to Airlines Cost to Passengers Cost from Lost Demand GDP Impact Source: Total Cost Impact Study
  3. 3. Business Problem Overview Model Predict Flight Delay Optimize operation Reduce further loss Airline Companies Help
  4. 4. Literature Review on Delay Costs Airline industry incurs an average cost of about $11,300 per delayed flight. based on 61,000 delayed flights per month average Excludes costs to passengers and lost demand A more accurate delay prediction system can help to identify operational variables that contribute to delays. While some conditions, such as weather, are not controllable factors, the way airlines and airports operate and optimize resources in the face of "acts of god" is controllable.
  5. 5. Data Understanding Dataset: On-Time Performance From Research and Innovative Technology Administration,BTS
  6. 6. Data Understanding Potentially Useful Variables: Quarter, Month; Day of Month Flight Number Origin Airport; Destination Airport Departure Block; Arrival Block Carrier Departure Delay; Arrival Delay Time Operation Geography Airline
  7. 7. Training: Testing: Data Preparation Selected Attributes from 2012 Data Derived Attributes from 2011 Data Selected Attributes from 2013 Data Derived Attributes from 2012 Data Attributes from Additional Dataset Attributes from Additional Dataset
  8. 8. Data Preparation Selected Attributes: 1. Quarter 2. Month 3. Day of Month 4. FL_NUM: Flight Number 5. Origin: Origin Airport 6. Dest: Destination Airport 7. UniqueCarrier: Unique Carrier Code 8. DepTimeBLK: Departure Time Block, Hourly Intervals 9. ArrTimeBLK: Arrival Time Block, Hourly Intervals Target: ArrDel: Arrival Delay, 1=Y, 0=N Removed for the project.to build the full model these attributes are necessary.
  9. 9. Data Preparation Derived Attributes: 1. Airline_Delay: the percentage of delay by each airline in one year 2. Flight_Delay: the percentage of delay by each specific flight in one year 3. Day_Delay: the percentage of delay by day of month for all flights in one year 4. Origin_Delay: the percentage of delay by each origin airport for all flights in one year 5. Dest_Delay: the percentage of delay by each destination airport for all flights in one year 6. Dep_BLK_Delay: the percentage of delay by each departure block for all flights in one year 7. Arr_BLK_Delay: the percentage of delay by each arrival for all flights in one year
  10. 10. Data Preparation Additional Dataset : Schedule Employees From Research and Innovative Technology Administration, BTS
  11. 11. Data Preparation Additional Attributes: 1. Full Time Employees in current month 2. Part Time Employees in current month 3. FTE Employees: Full Time Equivalent Employees in current month (2 part time= 1 full time) 4. Total Employees in current month We wanted to see if historical on-time performance and current staffing levels was enought to build a decent model.
  12. 12. Data Preparation Large size of dataset(2.9GB) Merge these attributes by month(via Excel Vlookup) Use data of one month, January, to build the model.
  13. 13. Modeling • Naive Beyes • Decision tree- J48(with various leaf sizes) • Logistic Regression “refused” to grocess in Weka
  14. 14. Modeling Preprocess • Convert the type of attributes • Convert csv file to arff(70MB) Training: • Instances: 422539 • Attributes: 19 Testing: • Instances: 478145 • Attributes: 19
  15. 15. NaiveBayes Modeling On Training Data Confusion Matrix of Naïve Bayes: a b <-- classified as 333876 28289 | a = 0 (on-time) 45761 14613 | b = 1 (delay) Accuracy ROC Area Naïve Bayes 82.475% 0.694 High cost, lower is better
  16. 16. Modeling- snapshot J48 with different parameter: MinObjNum Accuracy ROC Area 15 88.4917% 0.85 25 87.7308% 0.791 50 87.3311% 0.774 100 87.0414% 0.767 150 82.475% 0.694
  17. 17. Modeling - snapshot Confusion Matrix of J48, 25: a b ---classified as 356570 5595 a=0 46247 14127 b=1 Confusion Matrix of J48, 15: a b <-- classified as 356407 5758 | a = 0 42869 17505 | b = 1 Confusion Matrix of J48, 100: a b ---classified as 357881 4284 a=0 50471 14127 b=1 Confusion Matrix of J48, 50: a b ---classified as 356885 5280 | a = 0 48251 12123 | b = 1
  18. 18. Training model Performance-J48 0.7 0.72 0.74 0.76 0.78 0.8 0.82 0.84 0.86 15 25 50 100 150 ROC Area ROC Area The trend levels off around 0.76
  19. 19. Model Evaluation oThe evaluation is mainly based on the falsely classified on-time instances: this is the case where pessengers are given confidence on arrive on time while end up being late. oWe choose trainning model with largest AUC and smallest False Nagative value. MinObjNum Accuracy ROC Area FN Value Results 15 88.4917 % 0.85 42869 Reject 25 87.7308% 0.791 46247 Reject 50 87.3311% 0.774 48251 Reject 100 87.0414% 0.767 50471 Reject 150 86.8746 % 0.761 51276 Reject NaiveBeyes 82.475% 0.694 45761 Accept
  20. 20. Model Evaluation Model Performance on Testing Data(Jan 2013) Model ROC Area FN Value J48_minObjNum=100 0.512 82442 Naive Bayes 0.583 74058
  21. 21. Deployment Example : Avoiding the Most Delay Prone Parts of the System Schedule your air flight without a layover Avoid the major hubs by using smaller airports Chicago ORD, New York City (All), Atlanta were the worse in terms of congestion Early Morning Departure flights have better on-time performance Late Afternoon and early evening has the worst on-time performance
  22. 22. "When I can, I try to arrive the night before," says Russell Hayward, a USA TODAY Road Warrior. "But that eats up a whole work day, wasted travel time due to airline uncertainty." (Woodyard, 2001)

×