Prediction using Machine Learning (ML) techniques on Big Data is a computationally and system-wide challenging problem. Especially in the case when the system is processing approximately 10^9 observations per day scalability is the prime concern. In order to be able to rapidly train models covering whole multivariate space the time series vectors, which exhibit significant similarities, are clustered into the groups. Consequently the resulting vector clusters could be modelled using ML tools capable of coefficient estimation at the massive scale (Apache Spark with Scikit Learn). Presentation describes application of the Linear Regression and Support Vector Regression with Radial Basis Function kernel. This approach enables training models fast enough to complete the task within a couple of hours, allowing daily or even real time updates of the coefficients. The above machine learning framework is used to predict the airfares used as support tool for the Revenue Management systems.
Loudspeaker- direct radiating type and horn type.pptx
Airfare prediction using Machine Learning with Apache Spark on 1 billion observed airfares daily, Josef Habdank, Infare Solutions
1. Airfare prediction using
Machine Learning with Apache Spark
on 1 billion observed airfares daily
AGIFORS RM 2016
Josef Habdank
20th of May, 2016
Lead Data Scientist & Data Platform Architect
jha@infare.com www.infare.com
2. In business since 2000
150 Airlines Customers
11 Airports and
several OTAs Customers
7 offices worldwide
5
5000-6000 revenue managers
login to our platform every week
Leading provider of Airfare
Intelligence Solutions to
the Aviation Industry
Delivers actionable information
based on huge amount of freshly
collected and historical data
https://www.youtube.com/watch?v=h9cQTooY92E
3. Pharos: life analytics
Airfare Collection and Analytics
Online
Airfare Data
Collection
Data Processing
and Modelling Altus: historical
analytics
Data Feeds
4. Collecting 1 billion a day airfares
Reached 1bn/day airfares
on 7th of April 2016
Conservative projected
growth based on leads
-
500,000,000.00
1,000,000,000.00
1,500,000,000.00
2,000,000,000.00
2,500,000,000.00
3,000,000,000.00
3,500,000,000.00
Airfare observations daily
Observations Daily Extrapolated Observations Daily
5. Data collection doubling time ~7-12 months
Reached 1bn/day airfares
on 7th of April 2016
Conservative projected
growth based on leads
100,000.00
1,000,000.00
10,000,000.00
100,000,000.00
1,000,000,000.00
10,000,000,000.00
Airfare observations daily
Observations Daily Extrapolated Observations Daily
7. Infare technology stack
2016+
Data processing: Apache Spark
Message streaming: Kafka/Kinesis
BigData storage: Hadoop/S3
Microservices: C#.Net/Akka Spray
Real time analytics:
MsSql/Cassandra
Machine Learning:
PySpark + Scikit Learn
Tested on 6-8bn airfares a day
8. Reaching soon a full market coverage:
how to utilize it?
Infare DataCenter
Altus: historicalData Feeds Granular Data
Access API
(life + historical
queries to DB)
Prediction and
Analytics API
(all models
presented later)
Pharos: life data
+ prediction
Researched prediction since 2012, however accuracy requires larger market coverage.
Estimated that at 5bn airfares/day is the required coverage for launch of the final product.
11. Developing Prediction at Scale
• Tens to hundreds of millions of unique
trips observed daily
• Tens to hundreds observed prices per
trip
• Clustering price vectors
• Training model per cluster
• 10000-50000 models
• Training should take 2-3h to enable
daily or real time update
12. Prediction of highly multivariate time series
Drawing depicts trivial case in 2 dim and 3 models.
In reality there are tens of thousands clusters in > 300 dim space
Each point is representing
n-dim vector time series
Cluster the time series
(after dimensionality
reduction reducing sparsity)
Train ML models on the
data within respective cluster
13. Remarks regarding modelling
+
• Requires careful feature selection
• Dimensionality reduction of time series space done using
polynomial fitting or inverse exponential series fitting
• Transforms the price vectors into a parameters space
𝑓: 𝑃 ↦ Θ
• Clustering of time series projection Θ using k-means or
Gaussian Mixture Model
• ARIMA formulated as Linear Regression trained on P space:
𝐴𝑅𝐼𝑀𝐴 0, 1, 𝑛 ≡ 𝒚 = 𝑿𝛽 + 𝛼, 𝑤ℎ𝑒𝑟𝑒 dim 𝑿 = ∙, 𝑛
• For some clusters Support Vector Regression
with Radial Basis Function Kernel
• Quantize the continuous co-domain to finite states drawn from data
• Requires in-memory parallel processing, using Scikit Learn on PySpark
14. could be solved as Blind Source Separation or Machine Learning problem
Future research:
estimating competitors’ demand curves
Looking for a partner Airline to pilot this research project
Airline’s own
historical and
current demand
curves
Estimate of
competitor’s current
and future demand
curves
Infare’s historical
and current
market prices
15. Question to audience
What do you think is the
most important product?
1) Granular life
and historical data
access API
3) Estimating
competitors’
booking curves
2) Price Prediction in
Pharos + API
16. THANK YOU!
Please contact to us if you would
like to collaborate in research
Josef Habdank
20th of May, 2016
Lead Data Scientist & Data Platform Architect
jha@infare.com www.infare.com