SlideShare a Scribd company logo
1 of 16
Download to read offline
Airfare prediction using
Machine Learning with Apache Spark
on 1 billion observed airfares daily
AGIFORS RM 2016
Josef Habdank
20th of May, 2016
Lead Data Scientist & Data Platform Architect
jha@infare.com www.infare.com
In business since 2000
150 Airlines Customers
11 Airports and
several OTAs Customers
7 offices worldwide
5
5000-6000 revenue managers
login to our platform every week
Leading provider of Airfare
Intelligence Solutions to
the Aviation Industry
Delivers actionable information
based on huge amount of freshly
collected and historical data
https://www.youtube.com/watch?v=h9cQTooY92E
Pharos: life analytics
Airfare Collection and Analytics
Online
Airfare Data
Collection
Data Processing
and Modelling Altus: historical
analytics
Data Feeds
Collecting 1 billion a day airfares
Reached 1bn/day airfares
on 7th of April 2016
Conservative projected
growth based on leads
-
500,000,000.00
1,000,000,000.00
1,500,000,000.00
2,000,000,000.00
2,500,000,000.00
3,000,000,000.00
3,500,000,000.00
Airfare observations daily
Observations Daily Extrapolated Observations Daily
Data collection doubling time ~7-12 months
Reached 1bn/day airfares
on 7th of April 2016
Conservative projected
growth based on leads
100,000.00
1,000,000.00
10,000,000.00
100,000,000.00
1,000,000,000.00
10,000,000,000.00
Airfare observations daily
Observations Daily Extrapolated Observations Daily
Infare technology stack
2015
2016+
Infare technology stack
2016+
Data processing: Apache Spark
Message streaming: Kafka/Kinesis
BigData storage: Hadoop/S3
Microservices: C#.Net/Akka Spray
Real time analytics:
MsSql/Cassandra
Machine Learning:
PySpark + Scikit Learn
Tested on 6-8bn airfares a day
Reaching soon a full market coverage:
how to utilize it?
Infare DataCenter
Altus: historicalData Feeds Granular Data
Access API
(life + historical
queries to DB)
Prediction and
Analytics API
(all models
presented later)
Pharos: life data
+ prediction
Researched prediction since 2012, however accuracy requires larger market coverage.
Estimated that at 5bn airfares/day is the required coverage for launch of the final product.
Prediction: minimum future price
+ API access
Prediction: price evolution
+ API access
Developing Prediction at Scale
• Tens to hundreds of millions of unique
trips observed daily
• Tens to hundreds observed prices per
trip
• Clustering price vectors
• Training model per cluster
• 10000-50000 models
• Training should take 2-3h to enable
daily or real time update
Prediction of highly multivariate time series
Drawing depicts trivial case in 2 dim and 3 models.
In reality there are tens of thousands clusters in > 300 dim space
Each point is representing
n-dim vector time series
Cluster the time series
(after dimensionality
reduction reducing sparsity)
Train ML models on the
data within respective cluster
Remarks regarding modelling
+
• Requires careful feature selection
• Dimensionality reduction of time series space done using
polynomial fitting or inverse exponential series fitting
• Transforms the price vectors into a parameters space
𝑓: 𝑃 ↦ Θ
• Clustering of time series projection Θ using k-means or
Gaussian Mixture Model
• ARIMA formulated as Linear Regression trained on P space:
𝐴𝑅𝐼𝑀𝐴 0, 1, 𝑛 ≡ 𝒚 = 𝑿𝛽 + 𝛼, 𝑤ℎ𝑒𝑟𝑒 dim 𝑿 = ∙, 𝑛
• For some clusters Support Vector Regression
with Radial Basis Function Kernel
• Quantize the continuous co-domain to finite states drawn from data
• Requires in-memory parallel processing, using Scikit Learn on PySpark
could be solved as Blind Source Separation or Machine Learning problem
Future research:
estimating competitors’ demand curves
Looking for a partner Airline to pilot this research project
Airline’s own
historical and
current demand
curves
Estimate of
competitor’s current
and future demand
curves
Infare’s historical
and current
market prices
Question to audience
What do you think is the
most important product?
1) Granular life
and historical data
access API
3) Estimating
competitors’
booking curves
2) Price Prediction in
Pharos + API
THANK YOU!
Please contact to us if you would
like to collaborate in research
Josef Habdank
20th of May, 2016
Lead Data Scientist & Data Platform Architect
jha@infare.com www.infare.com

More Related Content

What's hot

Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine LearningSamra Shahzadi
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
UAV (Unmanned Aerial Vehicle)
UAV (Unmanned Aerial Vehicle)UAV (Unmanned Aerial Vehicle)
UAV (Unmanned Aerial Vehicle)UDIT PATEL
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
Safeguarding International civil aviation against acts of unlawful interference
Safeguarding International civil aviation against acts of unlawful interferenceSafeguarding International civil aviation against acts of unlawful interference
Safeguarding International civil aviation against acts of unlawful interferenceMomina Riaz
 
A330 MRTT Briefing, June 2016
A330 MRTT Briefing, June 2016A330 MRTT Briefing, June 2016
A330 MRTT Briefing, June 2016ICSA, LLC
 
CNS department industrial training report
CNS department industrial training reportCNS department industrial training report
CNS department industrial training reportPRAJJWAL ROHELA
 
Activation function
Activation functionActivation function
Activation functionAstha Jain
 
artificial neural network
artificial neural networkartificial neural network
artificial neural networkPallavi Yadav
 
Transportation Security Administration "TSA 101"
Transportation Security Administration "TSA 101"Transportation Security Administration "TSA 101"
Transportation Security Administration "TSA 101"TSA
 
Building blocks of deep learning
Building blocks of deep learningBuilding blocks of deep learning
Building blocks of deep learningKeshan Sodimana
 
Can protocol implementation for data communication (2)
Can protocol implementation for data communication (2)Can protocol implementation for data communication (2)
Can protocol implementation for data communication (2)karuna418
 
Operations management - Airline Scheduling
Operations management - Airline SchedulingOperations management - Airline Scheduling
Operations management - Airline SchedulingAshish Saxena
 
Afghanistan aip aerodromes part 1
Afghanistan aip aerodromes part 1Afghanistan aip aerodromes part 1
Afghanistan aip aerodromes part 1Ajay Agarwal
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computinginside-BigData.com
 
Neural network
Neural networkNeural network
Neural networkSilicon
 

What's hot (20)

Types of Machine Learning
Types of Machine LearningTypes of Machine Learning
Types of Machine Learning
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
Neural Networks
Neural NetworksNeural Networks
Neural Networks
 
UAV (Unmanned Aerial Vehicle)
UAV (Unmanned Aerial Vehicle)UAV (Unmanned Aerial Vehicle)
UAV (Unmanned Aerial Vehicle)
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
Safeguarding International civil aviation against acts of unlawful interference
Safeguarding International civil aviation against acts of unlawful interferenceSafeguarding International civil aviation against acts of unlawful interference
Safeguarding International civil aviation against acts of unlawful interference
 
A330 MRTT Briefing, June 2016
A330 MRTT Briefing, June 2016A330 MRTT Briefing, June 2016
A330 MRTT Briefing, June 2016
 
CNS department industrial training report
CNS department industrial training reportCNS department industrial training report
CNS department industrial training report
 
Activation function
Activation functionActivation function
Activation function
 
artificial neural network
artificial neural networkartificial neural network
artificial neural network
 
Transportation Security Administration "TSA 101"
Transportation Security Administration "TSA 101"Transportation Security Administration "TSA 101"
Transportation Security Administration "TSA 101"
 
Building blocks of deep learning
Building blocks of deep learningBuilding blocks of deep learning
Building blocks of deep learning
 
Flight Delay Prediction
Flight Delay PredictionFlight Delay Prediction
Flight Delay Prediction
 
Hebb network
Hebb networkHebb network
Hebb network
 
Can protocol implementation for data communication (2)
Can protocol implementation for data communication (2)Can protocol implementation for data communication (2)
Can protocol implementation for data communication (2)
 
Operations management - Airline Scheduling
Operations management - Airline SchedulingOperations management - Airline Scheduling
Operations management - Airline Scheduling
 
Afghanistan aip aerodromes part 1
Afghanistan aip aerodromes part 1Afghanistan aip aerodromes part 1
Afghanistan aip aerodromes part 1
 
Major Aircrash
Major Aircrash Major Aircrash
Major Aircrash
 
HPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific ComputingHPC + Ai: Machine Learning Models in Scientific Computing
HPC + Ai: Machine Learning Models in Scientific Computing
 
Neural network
Neural networkNeural network
Neural network
 

Viewers also liked

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnJosef A. Habdank
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapWithTheBest
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kiteJoey Echeverria
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
Building Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteBuilding Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteThe Hive
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data ScienceInfoFarm
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Spark Summit
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 

Viewers also liked (20)

Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearnPrediction as a service with ensemble model in SparkML and Python ScikitLearn
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Geo Python16 keynote
Geo Python16 keynoteGeo Python16 keynote
Geo Python16 keynote
 
Streaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara PrathapStreaming Analytics and Internet of Things - Geesara Prathap
Streaming Analytics and Internet of Things - Geesara Prathap
 
Twitter sentiment analysis
Twitter sentiment analysisTwitter sentiment analysis
Twitter sentiment analysis
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Big Data Usecases
Big Data UsecasesBig Data Usecases
Big Data Usecases
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
Building Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteBuilding Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom White
 
Introduction to (Big) Data Science
Introduction to (Big) Data ScienceIntroduction to (Big) Data Science
Introduction to (Big) Data Science
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
Real Time Fuzzy Matching with Spark and Elastic Search-(Sonal Goyal, Nube)
 
Spark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different RulesSpark SQL versus Apache Drill: Different Tools with Different Rules
Spark SQL versus Apache Drill: Different Tools with Different Rules
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 

Similar to Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions

The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...confluent
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkWenrui Meng
 
Development Infographic
Development InfographicDevelopment Infographic
Development InfographicRealMassive
 
Webinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingCREATE-NET
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of dataconfluent
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016Paolo Missier
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeSingleStore
 
Supply Chain Management
Supply Chain ManagementSupply Chain Management
Supply Chain ManagementKori Bori
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman FarahatSpark Summit
 
THIERRYCV DATASCIENCE
THIERRYCV DATASCIENCETHIERRYCV DATASCIENCE
THIERRYCV DATASCIENCEthierry bema
 
THIERRYCV DATASCIENCE
THIERRYCV DATASCIENCETHIERRYCV DATASCIENCE
THIERRYCV DATASCIENCEthierry bema
 
big data fest building modern data streaming apps
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming appsTimothy Spann
 
BigDataFest_ Building Modern Data Streaming Apps
BigDataFest_  Building Modern Data Streaming AppsBigDataFest_  Building Modern Data Streaming Apps
BigDataFest_ Building Modern Data Streaming Appsssuser73434e
 
The Wattminder Vision2009
The Wattminder Vision2009The Wattminder Vision2009
The Wattminder Vision2009solarMD
 
Exposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the CloudExposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the CloudNeil Gunther
 
Parallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationParallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationIJERA Editor
 

Similar to Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions (20)

The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
Streamsheets and Apache Kafka – Interactively build real-time Dashboards and ...
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Development Infographic
Development InfographicDevelopment Infographic
Development Infographic
 
Webinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computing
 
Real-time processing of large amounts of data
Real-time processing of large amounts of dataReal-time processing of large amounts of data
Real-time processing of large amounts of data
 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016
 
Real-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil DahlkeReal-Time, Geospatial, Maps by Neil Dahlke
Real-Time, Geospatial, Maps by Neil Dahlke
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
 
Supply Chain Management
Supply Chain ManagementSupply Chain Management
Supply Chain Management
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
THIERRYCV DATASCIENCE
THIERRYCV DATASCIENCETHIERRYCV DATASCIENCE
THIERRYCV DATASCIENCE
 
THIERRYCV DATASCIENCE
THIERRYCV DATASCIENCETHIERRYCV DATASCIENCE
THIERRYCV DATASCIENCE
 
big data fest building modern data streaming apps
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming apps
 
BigDataFest_ Building Modern Data Streaming Apps
BigDataFest_  Building Modern Data Streaming AppsBigDataFest_  Building Modern Data Streaming Apps
BigDataFest_ Building Modern Data Streaming Apps
 
The Wattminder Vision2009
The Wattminder Vision2009The Wattminder Vision2009
The Wattminder Vision2009
 
VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016VINEYARD Overview - ARC 2016
VINEYARD Overview - ARC 2016
 
Exposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the CloudExposing the Cost of Performance Hidden in the Cloud
Exposing the Cost of Performance Hidden in the Cloud
 
Parallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix MultiplicationParallel Processing Technique for Time Efficient Matrix Multiplication
Parallel Processing Technique for Time Efficient Matrix Multiplication
 

Recently uploaded

Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Measures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGMeasures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGSoniaBajaj10
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGSoniaBajaj10
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docxkarenmillo
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 

Recently uploaded (20)

Interferons.pptx.
Interferons.pptx.Interferons.pptx.
Interferons.pptx.
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Measures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UGMeasures of Central Tendency.pptx for UG
Measures of Central Tendency.pptx for UG
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UG
 
dll general biology week 1 - Copy.docx
dll general biology   week 1 - Copy.docxdll general biology   week 1 - Copy.docx
dll general biology week 1 - Copy.docx
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
AZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTXAZOTOBACTER AS BIOFERILIZER.PPTX
AZOTOBACTER AS BIOFERILIZER.PPTX
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 

Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions

  • 1. Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily AGIFORS RM 2016 Josef Habdank 20th of May, 2016 Lead Data Scientist & Data Platform Architect jha@infare.com www.infare.com
  • 2. In business since 2000 150 Airlines Customers 11 Airports and several OTAs Customers 7 offices worldwide 5 5000-6000 revenue managers login to our platform every week Leading provider of Airfare Intelligence Solutions to the Aviation Industry Delivers actionable information based on huge amount of freshly collected and historical data https://www.youtube.com/watch?v=h9cQTooY92E
  • 3. Pharos: life analytics Airfare Collection and Analytics Online Airfare Data Collection Data Processing and Modelling Altus: historical analytics Data Feeds
  • 4. Collecting 1 billion a day airfares Reached 1bn/day airfares on 7th of April 2016 Conservative projected growth based on leads - 500,000,000.00 1,000,000,000.00 1,500,000,000.00 2,000,000,000.00 2,500,000,000.00 3,000,000,000.00 3,500,000,000.00 Airfare observations daily Observations Daily Extrapolated Observations Daily
  • 5. Data collection doubling time ~7-12 months Reached 1bn/day airfares on 7th of April 2016 Conservative projected growth based on leads 100,000.00 1,000,000.00 10,000,000.00 100,000,000.00 1,000,000,000.00 10,000,000,000.00 Airfare observations daily Observations Daily Extrapolated Observations Daily
  • 7. Infare technology stack 2016+ Data processing: Apache Spark Message streaming: Kafka/Kinesis BigData storage: Hadoop/S3 Microservices: C#.Net/Akka Spray Real time analytics: MsSql/Cassandra Machine Learning: PySpark + Scikit Learn Tested on 6-8bn airfares a day
  • 8. Reaching soon a full market coverage: how to utilize it? Infare DataCenter Altus: historicalData Feeds Granular Data Access API (life + historical queries to DB) Prediction and Analytics API (all models presented later) Pharos: life data + prediction Researched prediction since 2012, however accuracy requires larger market coverage. Estimated that at 5bn airfares/day is the required coverage for launch of the final product.
  • 9. Prediction: minimum future price + API access
  • 11. Developing Prediction at Scale • Tens to hundreds of millions of unique trips observed daily • Tens to hundreds observed prices per trip • Clustering price vectors • Training model per cluster • 10000-50000 models • Training should take 2-3h to enable daily or real time update
  • 12. Prediction of highly multivariate time series Drawing depicts trivial case in 2 dim and 3 models. In reality there are tens of thousands clusters in > 300 dim space Each point is representing n-dim vector time series Cluster the time series (after dimensionality reduction reducing sparsity) Train ML models on the data within respective cluster
  • 13. Remarks regarding modelling + • Requires careful feature selection • Dimensionality reduction of time series space done using polynomial fitting or inverse exponential series fitting • Transforms the price vectors into a parameters space 𝑓: 𝑃 ↦ Θ • Clustering of time series projection Θ using k-means or Gaussian Mixture Model • ARIMA formulated as Linear Regression trained on P space: 𝐴𝑅𝐼𝑀𝐴 0, 1, 𝑛 ≡ 𝒚 = 𝑿𝛽 + 𝛼, 𝑤ℎ𝑒𝑟𝑒 dim 𝑿 = ∙, 𝑛 • For some clusters Support Vector Regression with Radial Basis Function Kernel • Quantize the continuous co-domain to finite states drawn from data • Requires in-memory parallel processing, using Scikit Learn on PySpark
  • 14. could be solved as Blind Source Separation or Machine Learning problem Future research: estimating competitors’ demand curves Looking for a partner Airline to pilot this research project Airline’s own historical and current demand curves Estimate of competitor’s current and future demand curves Infare’s historical and current market prices
  • 15. Question to audience What do you think is the most important product? 1) Granular life and historical data access API 3) Estimating competitors’ booking curves 2) Price Prediction in Pharos + API
  • 16. THANK YOU! Please contact to us if you would like to collaborate in research Josef Habdank 20th of May, 2016 Lead Data Scientist & Data Platform Architect jha@infare.com www.infare.com