SlideShare uma empresa Scribd logo
1 de 17
Baixar para ler offline
Data Analytics Master Class
AI specialist
Machine learning expert
Analytics expert
ETL engineer
Data architect
Devops engineer
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
Data engineer
Data scientist
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
“Data natives”
Understand entire pyramid
Specialise in a few layers
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
Iteration 1 Iteration 2 Iteration 3 Iteration 4
Simple ML algorithm
in Python
20 features
Only happy path
Manual runs,
no testing
Sample data
Cross validation,
Large scale
predictions in Spark
40 features
Cleaning of all
sources
Automated runs in
the cloud with testing
Internal, structured
data sources
External review,
improved accuracy
80 features
Basic
business logic
Monitoring, workflow
optimisation
External, unstructured
data sources
Optimised model,
real-time predictions
160 features
Advanced
Data integration
Fault-tolerant,
auto-scaling
Real-time data ingest
and real-time results
1. Think about the entire pyramid
2. Deliver value in iterations
3. Do a vertical slice at each iteration
https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
Implementation for Juyo
Sample data
Python on my laptop
simple booking logic
Features hacked together
Random Forest
on my laptop
Not there yet :-)
Iteration 1 Iteration 2
Booking data and event data
Delivered on an FTP server
Lighthouse managed service
Complex booking logic,
integrate event data
Feature engine
Random Forest
in a docker container
Not there yet :-)
Iteration 3
?
Components of Lighthouse managed service
AWS S3
Data lake storage. Stores he RAW
incoming data in any format. The
processed data uses a structured,
compressed format such as ORC
Apache Airflow
Orchestration and
scheduling of jobs. Track
success and duration of
jobs. Recover from failure.
Apache Nifi
Ingestion of any kind of
data
AWS EMR
Cluster processing: run Spark
jobs at scale. Run several
clusters in parallel for
mission-critical workloads.
AWS Athena
SQL queries directly on the
data in the data lake.
AWS Batch
Single node Processing: run any
workload in a docker container.
Many jobs can run in parallel.
AWS Glue Data Catalog
Metastore that describes
all data in the data lake. Is
available in EMR, Batch
and Athena
Apache Zeppelin
Notebook server to do data
science experiments on a
cluster in the cloud.
Data sources
Databases, CSV files, IoT
sensors, cell phones,
factories, www, ...
Consumption layer
Processing layer
Storage layer
Lighthouse UI
Visualise datasets, applications
and lineage on the web
Data export
Load data in downstream
systems such as AWS RDS,
Redshift, DynamoDB,
ElasticSearch, a REST API, ...
Example workflow: Clean the data and build features in Spark. Build and score the model in Python.
Cleansing
Raw data is converted to
ORC, invalid rows are
removed and cleaned
result is stored in S3
EMR cluster
Master records
An integrated view of
different data sources is
built as a single version
of the truth
Model training
A predictive model is built
using the features from
the previous stap as input
Feature building
Metrics are calculated
and features are built
using historical data
Model scoring
Metrics are calculated
and features are built
using historical data
AWS Batch
S3
/RAW
S3
/CLEAN
S3
/MASTER
S3
/METRICS
S3
/MODELS
RDS
PREDICTIONS
Code can be written in any language, as long as it runs in a docker container or on an EMR cluster.
Predictive model - approach
Features to predict for next day
Features to predict for next week
Features to predict for next year
27
Random Forest
models ←
Performance for XXX (revenue details)
7-days prediction - segment ABC
AVG ERROR = 10,00%
Performance for YYY (rooms details)
7-days prediction - segment ABC
AVG ERROR = 418,67%
backup slides
Both EMR and AWS Batch allow for very cost-effective runs using Spot instances, resulting in 80% cost
savings without compromising stability.
Spot instances == using free capacity of AWS
AWS rents out its unused capacity at market rates,
but can claim back the instance at any time. This can
be used on EMR and AWS Batch.
~80% cheaper
Market rates are about 80% cheaper than the
on-demand price, making it a very cost-effective way
of running big data pipelines
Stable, also for production use
By using Spot Fleets, you give AWS a range of
instances to run on. Claiming instances back
happens a few times per month, not every 20
minutes. Prices changes gradually, not suddenly.
Positive experience
We’re using Spot instances extensively at our clients.
We observe that our data pipelines are cheap, stable,
and recover from failure when needed.
Your AWS Account
PROD VPC STAGING VPC
Infrastructure view: all jobs run in private subnets or on AWS serverless infrastructure.
We set up a separate STAGING environment which is a replica of the PROD environment
AWS S3
Data PROD
AWS S3
Data STAGING
AWS S3
Artifacts
AWS S3
Infrastructure
Private Subnets
Replica of
PROD VPC
AWS EMR
AWS Athena
AWS Batch
AWS Glue
Zeppelin
Airflow
Nifi
Auto Scaling group
Auto Scaling group
Public subnets
Bastion server
AWS ECR
Lighthouse UI
Infrastructure setup: Fully automated setup using Terraform, bringing you live within one day, tailored
to your specific needs. It also allows for easy creation of separate STAGING and PROD environments
Base cloud infrastructure setup
Creation of VPC, Subnets, Security Groups, IAM users,
roles and policies, encryption keys, Bastion Server
Data lake setup
Creation of S3 buckets, EMR cluster configuration,
AWS Batch environment, Zeppelin Notebooks,
AWS Athena, AWS ECR
Application setup
Install Apache Airflow, Apache Nifi, Lighthouse UI
Client-specific setup
Install client-specific components such as AWS RDS,
Redshift, Elasticsearch, DynamoDB, …
1
2
3
4
Deployment way-of-working: all code is stored in git and deployments are automated using CircleCI
This allows for writing high-quality data pipelines in quick iterations
Develop a data job using
Scala / Python / Spark
Trigger build on CircleCI
Run tests and build
artifacts
Deploy artifacts to S3
Deploy Docker Containers
to ECR
Build a data pipeline
using the items created in
step 1
Deploy data pipeline to
STAGING environment
Data pipeline runs
in STAGING
Deploy data pipeline to
PROD environment
Data pipeline runs
in PRODUCTION
Fully automated
The entire deployment pipeline is
automated, so the engineer / scientist
does not waste time in manual and
error-prone release procedures
1
2

Mais conteúdo relacionado

Mais procurados

KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKai Wähner
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...Databricks
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAmazon Web Services
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Kai Wähner
 
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Un'introduzione a Kafka Streams e KSQL... and why they matter!Un'introduzione a Kafka Streams e KSQL... and why they matter!
Un'introduzione a Kafka Streams e KSQL... and why they matter!Paolo Castagna
 
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...Codemotion
 
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyConfluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyKairo Tavares
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Amazon Web Services
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKChoose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKSungmin Kim
 
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Mark Kromer
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsMarek Kuczynski
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kai Wähner
 
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...Kai Wähner
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
 
One Kubernetes to rule them all (ZEUS 2019 Keynote)
One Kubernetes to rule them all (ZEUS 2019 Keynote)One Kubernetes to rule them all (ZEUS 2019 Keynote)
One Kubernetes to rule them all (ZEUS 2019 Keynote)Simon Harrer
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafkaconfluent
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetupstevemcpherson
 

Mais procurados (20)

KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache KafkaKSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
 
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, ...
 
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQLAnnouncing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
Announcing Amazon Athena - Instantly Analyze Your Data in S3 Using SQL
 
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
Deep Learning at Extreme Scale (in the Cloud) 
with the Apache Kafka Open Sou...
 
Un'introduzione a Kafka Streams e KSQL... and why they matter!
Un'introduzione a Kafka Streams e KSQL... and why they matter!Un'introduzione a Kafka Streams e KSQL... and why they matter!
Un'introduzione a Kafka Streams e KSQL... and why they matter!
 
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
Kai Waehner - KSQL – The Open Source SQL Streaming Engine for Apache Kafka - ...
 
Aws slide
Aws slideAws slide
Aws slide
 
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made EasyConfluent Kafka and KSQL: Streaming Data Pipelines Made Easy
Confluent Kafka and KSQL: Streaming Data Pipelines Made Easy
 
Introduction to AWS Step Functions:
Introduction to AWS Step Functions: Introduction to AWS Step Functions:
Introduction to AWS Step Functions:
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSKChoose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
Choose Right Stream Storage: Amazon Kinesis Data Streams vs MSK
 
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
Microsoft Build 2018 Analytic Solutions with Azure Data Factory and Azure SQL...
 
Big data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The NetherlandsBig data and serverless - AWS UG The Netherlands
Big data and serverless - AWS UG The Netherlands
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
KSQL – The Open Source SQL Streaming Engine for Apache Kafka (Big Data Spain ...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
One Kubernetes to rule them all (ZEUS 2019 Keynote)
One Kubernetes to rule them all (ZEUS 2019 Keynote)One Kubernetes to rule them all (ZEUS 2019 Keynote)
One Kubernetes to rule them all (ZEUS 2019 Keynote)
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
 
Amazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto MeetupAmazon EMR Facebook Presto Meetup
Amazon EMR Facebook Presto Meetup
 

Semelhante a Data analytics master class: predict hotel revenue

Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaAmazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
Serverless Architectural Patterns & Best Practices
Serverless Architectural Patterns & Best PracticesServerless Architectural Patterns & Best Practices
Serverless Architectural Patterns & Best PracticesDaniel Zivkovic
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeTorsten Steinbach
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapIan Massingham
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAdrian Hornsby
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Amazon Web Services
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaHelen Rogers
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Amazon Web Services
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)Amazon Web Services
 
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...Amazon Web Services
 
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...Amazon Web Services
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017Pratim Das
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesAmazon Web Services
 

Semelhante a Data analytics master class: predict hotel revenue (20)

Architetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS LambdaArchitetture serverless e pattern avanzati per AWS Lambda
Architetture serverless e pattern avanzati per AWS Lambda
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Serverless Architectural Patterns & Best Practices
Serverless Architectural Patterns & Best PracticesServerless Architectural Patterns & Best Practices
Serverless Architectural Patterns & Best Practices
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
 
AWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:CapAWS re:Invent 2016 Day 1 Keynote re:Cap
AWS re:Invent 2016 Day 1 Keynote re:Cap
 
Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS Building a Big Data & Analytics Platform using AWS
Building a Big Data & Analytics Platform using AWS
 
Aws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon ElishaAws-What You Need to Know_Simon Elisha
Aws-What You Need to Know_Simon Elisha
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
AWS re:Invent 2016: Big Data Mini Con State of the Union (BDM205)
 
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
FSV307-Capital Markets Discovery How FINRA Runs Trade Analytics and Surveilla...
 
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
AWS re:Invent 2016: How Thermo Fisher Is Reducing Mass Spectrometry Experimen...
 
London Redshift Meetup - July 2017
London Redshift Meetup - July 2017London Redshift Meetup - July 2017
London Redshift Meetup - July 2017
 
BDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practicesBDA303 Serverless big data architectures: Design patterns and best practices
BDA303 Serverless big data architectures: Design patterns and best practices
 

Último

Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...Dr Arash Najmaei ( Phd., MBA, BSc)
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataTecnoIncentive
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 

Último (20)

Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
6 Tips for Interpretable Topic Models _ by Nicha Ruchirawat _ Towards Data Sc...
 
Cyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded dataCyber awareness ppt on the recorded data
Cyber awareness ppt on the recorded data
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 

Data analytics master class: predict hotel revenue

  • 2. AI specialist Machine learning expert Analytics expert ETL engineer Data architect Devops engineer https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 4. “Data natives” Understand entire pyramid Specialise in a few layers https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 5. Iteration 1 Iteration 2 Iteration 3 Iteration 4 Simple ML algorithm in Python 20 features Only happy path Manual runs, no testing Sample data Cross validation, Large scale predictions in Spark 40 features Cleaning of all sources Automated runs in the cloud with testing Internal, structured data sources External review, improved accuracy 80 features Basic business logic Monitoring, workflow optimisation External, unstructured data sources Optimised model, real-time predictions 160 features Advanced Data integration Fault-tolerant, auto-scaling Real-time data ingest and real-time results
  • 6. 1. Think about the entire pyramid 2. Deliver value in iterations 3. Do a vertical slice at each iteration https://hackernoon.com/the-ai-hierarchy-of-needs-18f111fcc007
  • 7. Implementation for Juyo Sample data Python on my laptop simple booking logic Features hacked together Random Forest on my laptop Not there yet :-) Iteration 1 Iteration 2 Booking data and event data Delivered on an FTP server Lighthouse managed service Complex booking logic, integrate event data Feature engine Random Forest in a docker container Not there yet :-) Iteration 3 ?
  • 8. Components of Lighthouse managed service AWS S3 Data lake storage. Stores he RAW incoming data in any format. The processed data uses a structured, compressed format such as ORC Apache Airflow Orchestration and scheduling of jobs. Track success and duration of jobs. Recover from failure. Apache Nifi Ingestion of any kind of data AWS EMR Cluster processing: run Spark jobs at scale. Run several clusters in parallel for mission-critical workloads. AWS Athena SQL queries directly on the data in the data lake. AWS Batch Single node Processing: run any workload in a docker container. Many jobs can run in parallel. AWS Glue Data Catalog Metastore that describes all data in the data lake. Is available in EMR, Batch and Athena Apache Zeppelin Notebook server to do data science experiments on a cluster in the cloud. Data sources Databases, CSV files, IoT sensors, cell phones, factories, www, ... Consumption layer Processing layer Storage layer Lighthouse UI Visualise datasets, applications and lineage on the web Data export Load data in downstream systems such as AWS RDS, Redshift, DynamoDB, ElasticSearch, a REST API, ...
  • 9. Example workflow: Clean the data and build features in Spark. Build and score the model in Python. Cleansing Raw data is converted to ORC, invalid rows are removed and cleaned result is stored in S3 EMR cluster Master records An integrated view of different data sources is built as a single version of the truth Model training A predictive model is built using the features from the previous stap as input Feature building Metrics are calculated and features are built using historical data Model scoring Metrics are calculated and features are built using historical data AWS Batch S3 /RAW S3 /CLEAN S3 /MASTER S3 /METRICS S3 /MODELS RDS PREDICTIONS Code can be written in any language, as long as it runs in a docker container or on an EMR cluster.
  • 10. Predictive model - approach Features to predict for next day Features to predict for next week Features to predict for next year 27 Random Forest models ←
  • 11. Performance for XXX (revenue details) 7-days prediction - segment ABC AVG ERROR = 10,00%
  • 12. Performance for YYY (rooms details) 7-days prediction - segment ABC AVG ERROR = 418,67%
  • 14. Both EMR and AWS Batch allow for very cost-effective runs using Spot instances, resulting in 80% cost savings without compromising stability. Spot instances == using free capacity of AWS AWS rents out its unused capacity at market rates, but can claim back the instance at any time. This can be used on EMR and AWS Batch. ~80% cheaper Market rates are about 80% cheaper than the on-demand price, making it a very cost-effective way of running big data pipelines Stable, also for production use By using Spot Fleets, you give AWS a range of instances to run on. Claiming instances back happens a few times per month, not every 20 minutes. Prices changes gradually, not suddenly. Positive experience We’re using Spot instances extensively at our clients. We observe that our data pipelines are cheap, stable, and recover from failure when needed.
  • 15. Your AWS Account PROD VPC STAGING VPC Infrastructure view: all jobs run in private subnets or on AWS serverless infrastructure. We set up a separate STAGING environment which is a replica of the PROD environment AWS S3 Data PROD AWS S3 Data STAGING AWS S3 Artifacts AWS S3 Infrastructure Private Subnets Replica of PROD VPC AWS EMR AWS Athena AWS Batch AWS Glue Zeppelin Airflow Nifi Auto Scaling group Auto Scaling group Public subnets Bastion server AWS ECR Lighthouse UI
  • 16. Infrastructure setup: Fully automated setup using Terraform, bringing you live within one day, tailored to your specific needs. It also allows for easy creation of separate STAGING and PROD environments Base cloud infrastructure setup Creation of VPC, Subnets, Security Groups, IAM users, roles and policies, encryption keys, Bastion Server Data lake setup Creation of S3 buckets, EMR cluster configuration, AWS Batch environment, Zeppelin Notebooks, AWS Athena, AWS ECR Application setup Install Apache Airflow, Apache Nifi, Lighthouse UI Client-specific setup Install client-specific components such as AWS RDS, Redshift, Elasticsearch, DynamoDB, … 1 2 3 4
  • 17. Deployment way-of-working: all code is stored in git and deployments are automated using CircleCI This allows for writing high-quality data pipelines in quick iterations Develop a data job using Scala / Python / Spark Trigger build on CircleCI Run tests and build artifacts Deploy artifacts to S3 Deploy Docker Containers to ECR Build a data pipeline using the items created in step 1 Deploy data pipeline to STAGING environment Data pipeline runs in STAGING Deploy data pipeline to PROD environment Data pipeline runs in PRODUCTION Fully automated The entire deployment pipeline is automated, so the engineer / scientist does not waste time in manual and error-prone release procedures 1 2