Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Building Serverless Analytics
Pipelines with AWS Glue
Tom McMeekin
Solutions Architect
Amazon Web Services
Drew Paterson
Solutions Architect
Amazon Web Services

There are more
people accessing data
And more
requirements for
making data available

Data
Engineering
Data stewardship
Data
pipelines
Data
structures
Data lakes
Extract
Transform
Load
Data modelling
Data marts
Data warehouse

AWS Glue
Serverless data catalogue and ETL service

AWS Glue Crawlers
Amazon S3 Data Lake Storage
AWS Glue Data Catalogue
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social
Automatically build your Data
Catalogue and keep it in sync
Built-in classifiers; custom
classifiers using Grok
expression
Run ad hoc or on a
schedule; serverless

Amazon Athena
Amazon Redshift
Amazon EMR
Amazon QuickSight
Amazon SageMaker
Amazon S3 Data Lake Storage
Search metadata for
data discovery
Single view across all
users, accounts, and
workloads

Use AWS Glue to cleanse, prep, and move
Serverless Apache Spark or Python
environment
Auto-generate, write or bring your own
Python or Scala code
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)

Apache Spark and AWS Glue ETL
AWS Glue builds on Apache Spark to offer ETL specific functionality
Apache Spark Core: RDDs
Apache Spark
DataFrames
AWS Glue
DynamicFrame
Apache SparkSQL AWS Glue ETL
Apache Spark is a distributed data processing engine for complex analytics

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
DataFrames
Core data structure for SparkSQL
Like structured tables
Need schema up-front
Each row has same structure
Suited for SQL-like analytics
DataFrames and DynamicFrames
DynamicFrames
Like DataFrames for ETL
Designed for processing semi-structured data,
e.g. JSON, Avro, Apache logs ...

Developer Endpoints / Notebooks
Raw Dataset
Amazon SageMaker
Notebook
Optimised Dataset
Connect your IDE to an
AWS Glue development
endpoint
Environment to
interactively develop,
debug, and test ETL code
AWS Glue
Data Catalouge

• Specify the capacity that
gets allocated to each job
• Pay only for the resources
you consume
• Auto-configure VPC and
role-based access
• Connect to on-premises
JDBC data stores as source
There is no need to provision, configure, or manage servers
AWS Glue: Job Execution - Serverless
VPC
Amazon RDS
AWS Glue
Corporate data center
Database
AWS Direct Connect

Three ways to orchestrate an AWS Glue ETL pipeline
• Schedule-driven
• Event-driven
• State machine–driven

Schedule driven
Crawl
raw
dataset
Run
‘optimise’
job
Crawl
optimised
dataset
SLA
deadlineReady
for
reporting
Work backwards from a daily SLA deadline

Event driven
Let Amazon CloudWatch Events and AWS Lambda drive the pipeline
Crawl
raw
dataset
Run
‘optimise’
job
Crawl
optimised
dataset
SLA
deadlineReady
for
reporting

State machine–driven
Let AWS Step Functions drive the pipeline

Data
Engineering
DevOps
CI/CD
Canary
deployments
Feature flags
Chaos
engineering
Configuration
management

CI/CD for AWS Glue ETL
AWS CodePipeline
• Help Data Engineers write quality code
• Automate the ETL job release management process
• Mitigate risk

AWS CodePipeline
pipe_line_template.yaml
etl_job.py
live_test.py
AWS
CodeCommit

AWS
CloudFormation
Amazon S3
(Raw data)
Amazon S3
(Test data)
AWS CodePipeline
AWS
CodeCommit
etl_job.py
Role

Amazon S3
(Raw data)
Amazon S3
(Test data)
AWS
CodeBuild
AWS
CloudFormation
AWS
CodeCommit
live_test.py

Amazon Athena
AWS
CodeBuild
AWS
CloudFormation
AWS CodePipeline
Amazon S3
(Data Lake)
Amazon S3
(Test Data)
SELECT count(*) FROM ”sales".”data_lake”;
SELECT count(*) FROM ”sales_parquet".”test_data";
AWS
CodeCommit
✓

AWS
CodeCommit
AWS
CodeBuild
AWS
CloudFormation
AWS
CloudFormation
AWS CodePipeline
Amazon S3
(Raw data)
Amazon S3
(Prd data)
etl_job.py
Role

Go learn
• Remember the three steps to build a serverless data pipeline
• Use AWS Glue features
• Leverage the breadth of the AWS Platform
• Scan your badge to receive links to learning resources

Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019

Semelhante a Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019 (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Building Serverless Analytics Pipelines with AWS Glue - AWS Summit Sydney 2019