SlideShare uma empresa Scribd logo
1 de 29
S U M M I T
SYDNEY
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Building Serverless Analytics
Pipelines with AWS Glue
Tom McMeekin
Solutions Architect
Amazon Web Services
Drew Paterson
Solutions Architect
Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
There are more
people accessing data
And more
requirements for
making data available
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data
Engineering
Data stewardship
Data
pipelines
Data
structures
Data lakes
Extract
Transform
Load
Data modelling
Data marts
Data warehouse
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue
Serverless data catalogue and ETL service
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue Crawlers
Amazon S3 Data Lake Storage
AWS Glue Data Catalogue
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social
Automatically build your Data
Catalogue and keep it in sync
Built-in classifiers; custom
classifiers using Grok
expression
Run ad hoc or on a
schedule; serverless
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue Data Catalogue
Amazon Athena
Amazon Redshift
Amazon EMR
Amazon QuickSight
Amazon SageMaker
Amazon S3 Data Lake Storage
Search metadata for
data discovery
Single view across all
users, accounts, and
workloads
AWS Glue Data Catalogue
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Use AWS Glue to cleanse, prep, and move
Serverless Apache Spark or Python
environment
Auto-generate, write or bring your own
Python or Scala code
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalogue
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Apache Spark and AWS Glue ETL
AWS Glue builds on Apache Spark to offer ETL specific functionality
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrame
Apache SparkSQL AWS Glue ETL
Apache Spark is a distributed data processing engine for complex analytics
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up-front
Each row has same structure
Suited for SQL-like analytics
DataFrames and DynamicFrames
DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
e.g. JSON, Avro, Apache logs ...
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Developer Endpoints / Notebooks
Raw Dataset
Amazon SageMaker
Notebook
Optimised Dataset
Connect your IDE to an
AWS Glue development
endpoint
Environment to
interactively develop,
debug, and test ETL code
AWS Glue
Data Catalouge
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
• Specify the capacity that
gets allocated to each job
• Pay only for the resources
you consume
• Auto-configure VPC and
role-based access
• Connect to on-premises
JDBC data stores as source
There is no need to provision, configure, or manage servers
AWS Glue: Job Execution - Serverless
VPC
Amazon RDS
AWS Glue
Corporate data center
Database
AWS Direct Connect
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Three ways to orchestrate an AWS Glue ETL pipeline
• Schedule-driven
• Event-driven
• State machine–driven
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Schedule driven
Crawl
raw
dataset
Run
‘optimise’
job
Crawl
optimised
dataset
SLA
deadlineReady
for
reporting
Work backwards from a daily SLA deadline
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Event driven
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
Crawl
raw
dataset
Run
‘optimise’
job
Crawl
optimised
dataset
SLA
deadlineReady
for
reporting
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
State machine–driven
Let AWS Step Functions drive the pipeline
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data
Engineering
DevOps
CI/CD
Canary
deployments
Feature flags
Chaos
engineering
Configuration
management
CI/CD for AWS Glue ETL
AWS CodePipeline
• Help Data Engineers write quality code
• Automate the ETL job release management process
• Mitigate risk
CI/CD for AWS Glue ETL
AWS CodePipeline
pipe_line_template.yaml
etl_job.py
live_test.py
AWS
CodeCommit
CI/CD for AWS Glue ETL
AWS
CloudFormation
Amazon S3
(Raw data)
Amazon S3
(Test data)
AWS CodePipeline
AWS
CodeCommit
pipe_line_template.yaml
etl_job.py
Role
CI/CD for AWS Glue ETL
Amazon S3
(Raw data)
Amazon S3
(Test data)
AWS Glue Data Catalogue
AWS
CodeBuild
AWS
CloudFormation
AWS
CodeCommit
live_test.py
CI/CD for AWS Glue ETL
Amazon Athena
AWS
CodeBuild
AWS
CloudFormation
AWS CodePipeline
Amazon S3
(Data Lake)
Amazon S3
(Test Data)
SELECT count(*) FROM ”sales".”data_lake”;
SELECT count(*) FROM ”sales_parquet".”test_data";
AWS
CodeCommit
✓
CI/CD for AWS Glue ETL
AWS
CodeCommit
AWS
CodeBuild
AWS
CloudFormation
AWS
CloudFormation
AWS CodePipeline
Amazon S3
(Raw data)
Amazon S3
(Prd data)
pipe_line_template.yaml
etl_job.py
Role
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Go learn
• Remember the three steps to build a serverless data pipeline
• Use AWS Glue features
• Leverage the breadth of the AWS Platform
• Scan your badge to receive links to learning resources
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Tom McMeekin Drew Paterson

Mais conteúdo relacionado

Mais procurados

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)NTT DATA Technology & Innovation
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightMicrosoft Tech Community
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...StreamNative
 
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみたtokita-r
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowPyData
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera, Inc.
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...Amazon Web Services Japan
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Migrate to Microsoft Azure with Confidence
Migrate to Microsoft Azure with ConfidenceMigrate to Microsoft Azure with Confidence
Migrate to Microsoft Azure with ConfidenceDavid J Rosenthal
 

Mais procurados (20)

Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsightIngestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
Ingestion in data pipelines with Managed Kafka Clusters in Azure HDInsight
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
Security and Multi-Tenancy with Apache Pulsar in Yahoo! (Verizon Media) - Pul...
 
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた
【GOJAS Meetup-10】Splunk:SmartStoreを使ってみた
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
202201 AWS Black Belt Online Seminar Apache Spark Performnace Tuning for AWS ...
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Migrate to Microsoft Azure with Confidence
Migrate to Microsoft Azure with ConfidenceMigrate to Microsoft Azure with Confidence
Migrate to Microsoft Azure with Confidence
 

Semelhante a Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS SummitServerless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS SummitAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitAmazon Web Services
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSAmazon Web Services
 
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitHow to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitAmazon Web Services
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019javier ramirez
 
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyBest Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyAmazon Web Services
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWSAmazon Web Services
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019Amazon Web Services
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Summits
 
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAsk me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAmazon Web Services
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSAmazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Scale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSScale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSAmazon Web Services
 
Implementing a Data Warehouse on AWS in a Hybrid Environment
Implementing a Data Warehouse on AWS in a Hybrid EnvironmentImplementing a Data Warehouse on AWS in a Hybrid Environment
Implementing a Data Warehouse on AWS in a Hybrid EnvironmentAmazon Web Services
 

Semelhante a Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019 (20)

Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS SummitServerless data prep with AWS Glue - ADB306 - New York AWS Summit
Serverless data prep with AWS Glue - ADB306 - New York AWS Summit
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS SummitPerforming serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
Performing serverless analytics in AWS Glue - ADB202 - Chicago AWS Summit
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWS
 
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS SummitHow to go from zero to data lakes in days - ADB202 - New York AWS Summit
How to go from zero to data lakes in days - ADB202 - New York AWS Summit
 
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
Building a Modern Data Platform on AWS. Public Sector Summit Brussels 2019
 
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit SydneyBest Practices for Migrating Databases to the Cloud - AWS Summit Sydney
Best Practices for Migrating Databases to the Cloud - AWS Summit Sydney
 
Building a Modern Data Platform on AWS
Building a Modern Data Platform on AWSBuilding a Modern Data Platform on AWS
Building a Modern Data Platform on AWS
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
AWS Analytics Services - When to use what? | AWS Summit Tel Aviv 2019
 
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS SummitAsk me anything about building data lakes on AWS - ADB209 - New York AWS Summit
Ask me anything about building data lakes on AWS - ADB209 - New York AWS Summit
 
Modern Data Platform on AWS
Modern Data Platform on AWSModern Data Platform on AWS
Modern Data Platform on AWS
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Scale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWSScale - Implementing a Data Warehouse on AWS
Scale - Implementing a Data Warehouse on AWS
 
Implementing a Data Warehouse on AWS in a Hybrid Environment
Implementing a Data Warehouse on AWS in a Hybrid EnvironmentImplementing a Data Warehouse on AWS in a Hybrid Environment
Implementing a Data Warehouse on AWS in a Hybrid Environment
 
Data_Analytics_and_AI_ML
Data_Analytics_and_AI_MLData_Analytics_and_AI_ML
Data_Analytics_and_AI_ML
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

  • 1. S U M M I T SYDNEY
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Building Serverless Analytics Pipelines with AWS Glue Tom McMeekin Solutions Architect Amazon Web Services Drew Paterson Solutions Architect Amazon Web Services
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T There are more people accessing data And more requirements for making data available
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Engineering Data stewardship Data pipelines Data structures Data lakes Extract Transform Load Data modelling Data marts Data warehouse
  • 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Serverless data catalogue and ETL service
  • 6. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Crawlers Amazon S3 Data Lake Storage AWS Glue Data Catalogue OLTP ERP CRM LOB Devices Web Sensors Social Automatically build your Data Catalogue and keep it in sync Built-in classifiers; custom classifiers using Grok expression Run ad hoc or on a schedule; serverless
  • 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T AWS Glue Data Catalogue Amazon Athena Amazon Redshift Amazon EMR Amazon QuickSight Amazon SageMaker Amazon S3 Data Lake Storage Search metadata for data discovery Single view across all users, accounts, and workloads AWS Glue Data Catalogue
  • 9. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Use AWS Glue to cleanse, prep, and move Serverless Apache Spark or Python environment Auto-generate, write or bring your own Python or Scala code Amazon S3 (Raw data) Amazon S3 (Staging data) Amazon S3 (Processed data) AWS Glue Data Catalogue
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Apache Spark and AWS Glue ETL AWS Glue builds on Apache Spark to offer ETL specific functionality Apache Spark Core: RDDs Apache Spark DataFrames AWS Glue DynamicFrame Apache SparkSQL AWS Glue ETL Apache Spark is a distributed data processing engine for complex analytics
  • 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DataFrames Core data structure for SparkSQL Like structured tables Need schema up-front Each row has same structure Suited for SQL-like analytics DataFrames and DynamicFrames DynamicFrames Like DataFrames for ETL Designed for processing semi-structured data, e.g. JSON, Avro, Apache logs ...
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Developer Endpoints / Notebooks Raw Dataset Amazon SageMaker Notebook Optimised Dataset Connect your IDE to an AWS Glue development endpoint Environment to interactively develop, debug, and test ETL code AWS Glue Data Catalouge
  • 14. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T • Specify the capacity that gets allocated to each job • Pay only for the resources you consume • Auto-configure VPC and role-based access • Connect to on-premises JDBC data stores as source There is no need to provision, configure, or manage servers AWS Glue: Job Execution - Serverless VPC Amazon RDS AWS Glue Corporate data center Database AWS Direct Connect
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Three ways to orchestrate an AWS Glue ETL pipeline • Schedule-driven • Event-driven • State machine–driven
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Schedule driven Crawl raw dataset Run ‘optimise’ job Crawl optimised dataset SLA deadlineReady for reporting Work backwards from a daily SLA deadline
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Event driven Let Amazon CloudWatch Events and AWS Lambda drive the pipeline Crawl raw dataset Run ‘optimise’ job Crawl optimised dataset SLA deadlineReady for reporting
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T State machine–driven Let AWS Step Functions drive the pipeline
  • 20. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Data Engineering DevOps CI/CD Canary deployments Feature flags Chaos engineering Configuration management
  • 22. CI/CD for AWS Glue ETL AWS CodePipeline • Help Data Engineers write quality code • Automate the ETL job release management process • Mitigate risk
  • 23. CI/CD for AWS Glue ETL AWS CodePipeline pipe_line_template.yaml etl_job.py live_test.py AWS CodeCommit
  • 24. CI/CD for AWS Glue ETL AWS CloudFormation Amazon S3 (Raw data) Amazon S3 (Test data) AWS CodePipeline AWS CodeCommit pipe_line_template.yaml etl_job.py Role
  • 25. CI/CD for AWS Glue ETL Amazon S3 (Raw data) Amazon S3 (Test data) AWS Glue Data Catalogue AWS CodeBuild AWS CloudFormation AWS CodeCommit live_test.py
  • 26. CI/CD for AWS Glue ETL Amazon Athena AWS CodeBuild AWS CloudFormation AWS CodePipeline Amazon S3 (Data Lake) Amazon S3 (Test Data) SELECT count(*) FROM ”sales".”data_lake”; SELECT count(*) FROM ”sales_parquet".”test_data"; AWS CodeCommit ✓
  • 27. CI/CD for AWS Glue ETL AWS CodeCommit AWS CodeBuild AWS CloudFormation AWS CloudFormation AWS CodePipeline Amazon S3 (Raw data) Amazon S3 (Prd data) pipe_line_template.yaml etl_job.py Role
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Go learn • Remember the three steps to build a serverless data pipeline • Use AWS Glue features • Leverage the breadth of the AWS Platform • Scan your badge to receive links to learning resources
  • 29. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Tom McMeekin Drew Paterson