We predict future revenues in hotels by solving the data science puzzle end-to-end: from infrastructure in the cloud and security, to data ingestion, data cleaning, feature building and model training and model scoring.
The video of this talk is here: https://www.facebook.com/datamindedbe/posts/1385820021562117
4. “Data natives”
Understand entire pyramid
Specialise in a few layers
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
5. Iteration 1 Iteration 2 Iteration 3 Iteration 4
Simple ML algorithm
in Python
20 features
Only happy path
Manual runs,
no testing
Sample data
Cross validation,
Large scale
predictions in Spark
40 features
Cleaning of all
sources
Automated runs in
the cloud with testing
Internal, structured
data sources
External review,
improved accuracy
80 features
Basic
business logic
Monitoring, workflow
optimisation
External, unstructured
data sources
Optimised model,
real-time predictions
160 features
Advanced
Data integration
Fault-tolerant,
auto-scaling
Real-time data ingest
and real-time results
6. 1. Think about the entire pyramid
2. Deliver value in iterations
3. Do a vertical slice at each iteration
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
7. Implementation for Juyo
Sample data
Python on my laptop
simple booking logic
Features hacked together
Random Forest
on my laptop
Not there yet :-)
Iteration 1 Iteration 2
Booking data and event data
Delivered on an FTP server
Lighthouse managed service
Complex booking logic,
integrate event data
Feature engine
Random Forest
in a docker container
Not there yet :-)
Iteration 3
?
8. Components of Lighthouse managed service
AWS S3
Data lake storage. Stores he RAW
incoming data in any format. The
processed data uses a structured,
compressed format such as ORC
Apache Airflow
Orchestration and
scheduling of jobs. Track
success and duration of
jobs. Recover from failure.
Apache Nifi
Ingestion of any kind of
data
AWS EMR
Cluster processing: run Spark
jobs at scale. Run several
clusters in parallel for
mission-critical workloads.
AWS Athena
SQL queries directly on the
data in the data lake.
AWS Batch
Single node Processing: run any
workload in a docker container.
Many jobs can run in parallel.
AWS Glue Data Catalog
Metastore that describes
all data in the data lake. Is
available in EMR, Batch
and Athena
Apache Zeppelin
Notebook server to do data
science experiments on a
cluster in the cloud.
Data sources
Databases, CSV files, IoT
sensors, cell phones,
factories, www, ...
Consumption layer
Processing layer
Storage layer
Lighthouse UI
Visualise datasets, applications
and lineage on the web
Data export
Load data in downstream
systems such as AWS RDS,
Redshift, DynamoDB,
ElasticSearch, a REST API, ...
9. Example workflow: Clean the data and build features in Spark. Build and score the model in Python.
Cleansing
Raw data is converted to
ORC, invalid rows are
removed and cleaned
result is stored in S3
EMR cluster
Master records
An integrated view of
different data sources is
built as a single version
of the truth
Model training
A predictive model is built
using the features from
the previous stap as input
Feature building
Metrics are calculated
and features are built
using historical data
Model scoring
Metrics are calculated
and features are built
using historical data
AWS Batch
S3
/RAW
S3
/CLEAN
S3
/MASTER
S3
/METRICS
S3
/MODELS
RDS
PREDICTIONS
Code can be written in any language, as long as it runs in a docker container or on an EMR cluster.
10. Predictive model - approach
Features to predict for next day
Features to predict for next week
Features to predict for next year
27
Random Forest
models ←
14. Both EMR and AWS Batch allow for very cost-effective runs using Spot instances, resulting in 80% cost
savings without compromising stability.
Spot instances == using free capacity of AWS
AWS rents out its unused capacity at market rates,
but can claim back the instance at any time. This can
be used on EMR and AWS Batch.
~80% cheaper
Market rates are about 80% cheaper than the
on-demand price, making it a very cost-effective way
of running big data pipelines
Stable, also for production use
By using Spot Fleets, you give AWS a range of
instances to run on. Claiming instances back
happens a few times per month, not every 20
minutes. Prices changes gradually, not suddenly.
Positive experience
We’re using Spot instances extensively at our clients.
We observe that our data pipelines are cheap, stable,
and recover from failure when needed.
15. Your AWS Account
PROD VPC STAGING VPC
Infrastructure view: all jobs run in private subnets or on AWS serverless infrastructure.
We set up a separate STAGING environment which is a replica of the PROD environment
AWS S3
Data PROD
AWS S3
Data STAGING
AWS S3
Artifacts
AWS S3
Infrastructure
Private Subnets
Replica of
PROD VPC
AWS EMR
AWS Athena
AWS Batch
AWS Glue
Zeppelin
Airflow
Nifi
Auto Scaling group
Auto Scaling group
Public subnets
Bastion server
AWS ECR
Lighthouse UI
16. Infrastructure setup: Fully automated setup using Terraform, bringing you live within one day, tailored
to your specific needs. It also allows for easy creation of separate STAGING and PROD environments
Base cloud infrastructure setup
Creation of VPC, Subnets, Security Groups, IAM users,
roles and policies, encryption keys, Bastion Server
Data lake setup
Creation of S3 buckets, EMR cluster configuration,
AWS Batch environment, Zeppelin Notebooks,
AWS Athena, AWS ECR
Application setup
Install Apache Airflow, Apache Nifi, Lighthouse UI
Client-specific setup
Install client-specific components such as AWS RDS,
Redshift, Elasticsearch, DynamoDB, …
1
2
3
4
17. Deployment way-of-working: all code is stored in git and deployments are automated using CircleCI
This allows for writing high-quality data pipelines in quick iterations
Develop a data job using
Scala / Python / Spark
Trigger build on CircleCI
Run tests and build
artifacts
Deploy artifacts to S3
Deploy Docker Containers
to ECR
Build a data pipeline
using the items created in
step 1
Deploy data pipeline to
STAGING environment
Data pipeline runs
in STAGING
Deploy data pipeline to
PROD environment
Data pipeline runs
in PRODUCTION
Fully automated
The entire deployment pipeline is
automated, so the engineer / scientist
does not waste time in manual and
error-prone release procedures
1
2