SlideShare uma empresa Scribd logo
1 de 51
Baixar para ler offline
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Craig Roach
Solution Architect, Amazon Web Services
Building Serverless ETL Pipelines
With AWS Glue
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Generate Collect Store
Extract
Transform
Load
Analyse
Visualise/
Report
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Collect Store
Extract
Transform
Load
Analyse
Visualise/
Report
Generate
ERP
Connected
devices
Transactions
Social
media
Web logs /
cookies
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Generate Store
Extract
Transform
Load
Analyse
Visualise/
Report
Collect
Polling
Application
Amazon
Kinesis
Stream
Amazon
Kinesis
Firehose
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Generate Store
Extract
Transform
Load
Analyse
Visualise/
Report
Collect
AWSSnowball
Amazon S3
AWSDMS
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Generate Collect
Extract
Transform
Load
Analyse
Visualise/
Report
Store
Amazon
RDS
Amazon S3
Database
on EC2
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Generate Collect Store
Extract
Transform
Load
Visualise/
Report
Analyse
Amazon Redshift
& Redshift
Spectrum
Amazon EMR
Amazon Athena
Amazon Kinesis
Analytics
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Generate Collect Store
Extract
Transform
Load
Analyse
Visualise/
Report
Data
scientists
Business
users
Engagement
platforms
Automation/
events
Data
analysts
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Generate Collect Store Analyse
Visualise/
Report
Extract
Transform
Load
AWS
Lambda
Amazon
Kinesis Enabled
Amazon EMR
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
No Problem…?
Deal with these Terabytes and Petabytes of data
Simplify querying disparate data sets
Combine existing / legacy data with modern data sets
Prepare data for machine learning
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Some extra challenges..
Volumes will grow (the new oil)
Adding data sources
Large proportion of ETL is hand coding
Data formats change over time
• Within data you already have
• Changes will be coming soon
Target schemas change
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Extract Transform
Load
Analyse
Visualise/
Report
Generate
Collect
Store
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
And.. ETL Is Not The Rewarding Part
Time
Value
ETL
Analyse and Consume
Generate
Collect
Store
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Generate Collect Store Analyse
Visualise/
Report
Extract
Transform
Load
AWS Glue
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why AWS Glue?
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Automate your ETL
Automatically discover and categorise your data
• Connect to your data sources
• Generate your Data Catalogue
Make it immediately searchable and query-able
• Amazon Athena
• Amazon Redshift
• Amazon EMR
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Automate your ETL
Generates your ETL code
• Clean
• Enrich
• Move
Adaptable code
Extension to Spark in Python or Scala
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Automate your ETL
Runs your ETL jobs serverless
• Managed
• Control the amount of resources used
• Scales out automatically
Schedule or trigger jobs
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Why Glue Examples
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
How do I ETL my data?
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Four Steps
Crawl Map Edit and
Explore
Schedule
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
How Do I Discover My Data?
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Glue Data Catalogue: Crawlers
• Automatically discover new data and extract schema definitions
• Detect schema changes and version tables
• Detect Apache Hive style partitions on Amazon S3
• Built-in classifiers for popular data types
• Custom classifiers using Grok expressions
• Run ad hoc or on a schedule; serverless – only pay when crawler
runs
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom
Classifiers with Grok!
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Example Classifier
2018-03-18T01:44:19+00:00 [prefix-p-123-a-7z] WARN: There is a message
Grok expression example:
%{TIMESTAMP_ISO8601:timestamp} [%{MESSAGEPREFIX:message_prefix}] %{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}
Built in patterns:
TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
GREEDYDATA .*
Custom patterns
CRAWLERLOGLEVEL (BENCHMARK|ERROR|WARN|INFO|TRACE)
MESSAGEPREFIX .*-.*-.*-.*-.*
Handy Grok debugger:
https://grokdebug.herokuapp.com/
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Crawler: Detecting Partitions
file 1 file N… file 1 file N…
date=10 date=15…
month=Nov
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Glue Data Catalog
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Glue Data Catalog: Table Properties
Table
schema
Table properties
Data statistics
Nested fields
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Glue Data Catalog: Version control
List of table versionsCompare schema versions
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
How Do I Build The ETL?
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: Automatic Code Generation
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: Automatic Code Generation
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: ETL Code
 Human-readable, editable, and portable PySpark code
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: Glue Dynamic Frames
Dynamic frame schema
A C D [ ]
X Y
B1 B2
Like Apache Spark’s Data Frames, but better for:
• Cleaning and (re)-structuring semi-structured
data sets, e.g. JSON, Avro, Apache logs ...
No upfront schema needed:
• Infers schema on-the-fly, enabling
transformations in a single pass
Easy to handle the unexpected:
• Tracks new fields, and inconsistent changing
data types with choices, e.g. integer or string
• Automatically mark and separate error records
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: Glue Transforms
B B B
project
B
cast
B
separate into cols
B BResolveChoice()
ApplyMapping()
C
YX
A
A X Y
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: Relationalize() Transform
Semi-structured schema Relational schema
FKA B C.X C.Y
PK Valu
e
Offset
A C D [ ]
X Y
B B
B
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: ETL Code
• Human-readable, editable, and portable PySpark code
• Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
• Customisable: Use native PySpark, import custom libraries, and/or leverage Glue’s
libraries
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: Developer Endpoints
Remote
Interpreter
• Environment to iteratively develop and test ETL code.
• Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint.
• When you are satisfied with the results you can create an ETL job that runs your
code.
Interpreter
Server
Glue Apache Spark environment
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Authoring: ETL Code
 Human-readable, editable, and portable
PySpark code
 Flexible: Glue’s ETL library simplifies
manipulating complex, semi-structured
data
 Customisable: Use native PySpark, import
custom libraries, and/or leverage Glue’s
libraries
 Collaborative: share code snippets via
GitHub, reuse code across jobs
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
How do I run ETL jobs?
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Serverless Job Execution
Auto-configure VPC & role-based access
security & isolation preserved
Customers can specify job capacity
using Data Processing Units (DPU)
Automatically scale resources
Only pay for the resources you consume
per-second billing (10-minute min)
Customer VPC Customer VPC
Compute instances
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Data Processing Units (DPUs)
1 DPU = 4 vCPU + 16GB RAM
Storage:
• Free for the first million objects stored
• $1 per 100,000 objects stored above 1M, per month
Requests:
• Free for the first million requests per month
• $1 per million requests above 1M in a month
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Job Composition: Example
Data based
>10 MB new
ad-click
logs
Sales: Revenue by
customer segment
Schedule
Central: ROI by
customer segment
weekly
sales
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Sounds Good In Theory…
What’s It Really Like?
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Demo context
Amazon
RDS
Amazon S3
AWS Glue
Amazon Redshift
& Redshift
Spectrum
Amazon EMR
Amazon Athena
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
How AWS Glue Helps with ETL
Automatically discover your data
Generate ETL code
Run your ETL jobs serverless
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Speaker Contact
Solutions Architect
craigar@amazon.com
Craig Roach
Thank You

Mais conteúdo relacionado

Mais procurados

The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineAmazon Web Services
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxSwathiPonugumati
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...IDERA Software
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveCobus Bernard
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarAmazon Web Services
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSAmazon Web Services
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks
 
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfData & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfChris Bingham
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Customer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer ExperiencesCustomer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer ExperiencesInformatica
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...HostedbyConfluent
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 

Mais procurados (20)

The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain PipelineThe Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
The Zen of DataOps – AWS Lake Formation and the Data Supply Chain Pipeline
 
Introduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptxIntroduction to AWS Lake Formation.pptx
Introduction to AWS Lake Formation.pptx
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...Idera live 2021:  Keynote Presentation The Future of Data is The Data Cloud b...
Idera live 2021: Keynote Presentation The Future of Data is The Data Cloud b...
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI InitiativesDatabricks + Snowflake: Catalyzing Data and AI Initiatives
Databricks + Snowflake: Catalyzing Data and AI Initiatives
 
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdfData & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
Data & Analytics ReInvent Recap [AWS Basel Meetup - Jan 2023].pdf
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Customer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer ExperiencesCustomer-Centric Data Management for Better Customer Experiences
Customer-Centric Data Management for Better Customer Experiences
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
Standing on the Shoulders of Open-Source Giants: The Serverless Realtime Lake...
 
Snowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the UglySnowflake: The Good, the Bad, and the Ugly
Snowflake: The Good, the Bad, and the Ugly
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Lakehouse in Azure
Lakehouse in AzureLakehouse in Azure
Lakehouse in Azure
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Data Mesh
Data MeshData Mesh
Data Mesh
 

Semelhante a Building Serverless ETL Pipelines with AWS Glue

Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018Amazon Web Services
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL PipelinesAmazon Web Services
 
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Amazon Web Services
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...Amazon Web Services LATAM
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfAmazon Web Services
 
Workshop: Architecting a Serverless Data Lake
Workshop: Architecting a Serverless Data LakeWorkshop: Architecting a Serverless Data Lake
Workshop: Architecting a Serverless Data LakeAmazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
It's all about the data - Tel Aviv Summit 2018
It's all about the data - Tel Aviv Summit 2018It's all about the data - Tel Aviv Summit 2018
It's all about the data - Tel Aviv Summit 2018Amazon Web Services
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczAmazon Web Services
 

Semelhante a Building Serverless ETL Pipelines with AWS Glue (20)

Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
Building Serverless ETL Pipelines with AWS Glue - AWS Summit Sydney 2018
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
 
Building Serverless ETL Pipelines
Building Serverless ETL PipelinesBuilding Serverless ETL Pipelines
Building Serverless ETL Pipelines
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
Serverless Data Prep with AWS Glue (ANT313) - AWS re:Invent 2018
 
Log Analytics with AWS
Log Analytics with AWSLog Analytics with AWS
Log Analytics with AWS
 
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...Immersion Day -  Como gerenciar seu catálogo de dados e processo de transform...
Immersion Day - Como gerenciar seu catálogo de dados e processo de transform...
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
Workshop: Architecting a Serverless Data Lake
Workshop: Architecting a Serverless Data LakeWorkshop: Architecting a Serverless Data Lake
Workshop: Architecting a Serverless Data Lake
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
It's all about the data - Tel Aviv Summit 2018
It's all about the data - Tel Aviv Summit 2018It's all about the data - Tel Aviv Summit 2018
It's all about the data - Tel Aviv Summit 2018
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter Dachnowicz
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Building Serverless ETL Pipelines with AWS Glue

  • 1. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Craig Roach Solution Architect, Amazon Web Services Building Serverless ETL Pipelines With AWS Glue
  • 2. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Generate Collect Store Extract Transform Load Analyse Visualise/ Report
  • 3. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Collect Store Extract Transform Load Analyse Visualise/ Report Generate ERP Connected devices Transactions Social media Web logs / cookies
  • 4. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Generate Store Extract Transform Load Analyse Visualise/ Report Collect Polling Application Amazon Kinesis Stream Amazon Kinesis Firehose
  • 5. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Generate Store Extract Transform Load Analyse Visualise/ Report Collect AWSSnowball Amazon S3 AWSDMS
  • 6. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Generate Collect Extract Transform Load Analyse Visualise/ Report Store Amazon RDS Amazon S3 Database on EC2
  • 7. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Generate Collect Store Extract Transform Load Visualise/ Report Analyse Amazon Redshift & Redshift Spectrum Amazon EMR Amazon Athena Amazon Kinesis Analytics
  • 8. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Generate Collect Store Extract Transform Load Analyse Visualise/ Report Data scientists Business users Engagement platforms Automation/ events Data analysts
  • 9. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Generate Collect Store Analyse Visualise/ Report Extract Transform Load AWS Lambda Amazon Kinesis Enabled Amazon EMR
  • 10. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. No Problem…? Deal with these Terabytes and Petabytes of data Simplify querying disparate data sets Combine existing / legacy data with modern data sets Prepare data for machine learning
  • 11. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
  • 12. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
  • 13. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
  • 14. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Some extra challenges.. Volumes will grow (the new oil) Adding data sources Large proportion of ETL is hand coding Data formats change over time • Within data you already have • Changes will be coming soon Target schemas change
  • 15. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Extract Transform Load Analyse Visualise/ Report Generate Collect Store
  • 16. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. And.. ETL Is Not The Rewarding Part Time Value ETL Analyse and Consume Generate Collect Store
  • 17. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Generate Collect Store Analyse Visualise/ Report Extract Transform Load AWS Glue
  • 18. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why AWS Glue?
  • 19. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Automate your ETL Automatically discover and categorise your data • Connect to your data sources • Generate your Data Catalogue Make it immediately searchable and query-able • Amazon Athena • Amazon Redshift • Amazon EMR
  • 20. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Automate your ETL Generates your ETL code • Clean • Enrich • Move Adaptable code Extension to Spark in Python or Scala
  • 21. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Automate your ETL Runs your ETL jobs serverless • Managed • Control the amount of resources used • Scales out automatically Schedule or trigger jobs
  • 22. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Why Glue Examples
  • 23. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. How do I ETL my data?
  • 24. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Four Steps Crawl Map Edit and Explore Schedule
  • 25. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. How Do I Discover My Data?
  • 26. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Glue Data Catalogue: Crawlers • Automatically discover new data and extract schema definitions • Detect schema changes and version tables • Detect Apache Hive style partitions on Amazon S3 • Built-in classifiers for popular data types • Custom classifiers using Grok expressions • Run ad hoc or on a schedule; serverless – only pay when crawler runs
  • 27. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Crawlers: Classifiers IAM Role Glue Crawler Data Lakes Data Warehouse Databases Amazon RDS Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-In Classifiers MySQL MariaDB PostreSQL Aurora Redshift Avro Parquet ORC JSON & BJSON Logs (Apache, Linux, MS, Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) Compressed Formats (ZIP, BZIP, GZIP, LZ4, Snappy) Create additional Custom Classifiers with Grok!
  • 28. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Example Classifier 2018-03-18T01:44:19+00:00 [prefix-p-123-a-7z] WARN: There is a message Grok expression example: %{TIMESTAMP_ISO8601:timestamp} [%{MESSAGEPREFIX:message_prefix}] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message} Built in patterns: TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}? GREEDYDATA .* Custom patterns CRAWLERLOGLEVEL (BENCHMARK|ERROR|WARN|INFO|TRACE) MESSAGEPREFIX .*-.*-.*-.*-.* Handy Grok debugger: https://grokdebug.herokuapp.com/
  • 29. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Crawler: Detecting Partitions file 1 file N… file 1 file N… date=10 date=15… month=Nov S3 bucket hierarchy Table definition Estimate schema similarity among files at each level to handle semi-structured logs, schema evolution… sim=.99 sim=.95 sim=.93 month date col 1 col 2 str str int float Column Type
  • 30. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Glue Data Catalog
  • 31. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Glue Data Catalog: Table Properties Table schema Table properties Data statistics Nested fields
  • 32. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Glue Data Catalog: Version control List of table versionsCompare schema versions
  • 33. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. How Do I Build The ETL?
  • 34. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: Automatic Code Generation
  • 35. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: Automatic Code Generation
  • 36. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: ETL Code  Human-readable, editable, and portable PySpark code
  • 37. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: Glue Dynamic Frames Dynamic frame schema A C D [ ] X Y B1 B2 Like Apache Spark’s Data Frames, but better for: • Cleaning and (re)-structuring semi-structured data sets, e.g. JSON, Avro, Apache logs ... No upfront schema needed: • Infers schema on-the-fly, enabling transformations in a single pass Easy to handle the unexpected: • Tracks new fields, and inconsistent changing data types with choices, e.g. integer or string • Automatically mark and separate error records
  • 38. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: Glue Transforms B B B project B cast B separate into cols B BResolveChoice() ApplyMapping() C YX A A X Y
  • 39. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: Relationalize() Transform Semi-structured schema Relational schema FKA B C.X C.Y PK Valu e Offset A C D [ ] X Y B B B
  • 40. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: ETL Code • Human-readable, editable, and portable PySpark code • Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data • Customisable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries
  • 41. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: Developer Endpoints Remote Interpreter • Environment to iteratively develop and test ETL code. • Connect your IDE or notebook (e.g. Zeppelin) to a Glue development endpoint. • When you are satisfied with the results you can create an ETL job that runs your code. Interpreter Server Glue Apache Spark environment
  • 42. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Authoring: ETL Code  Human-readable, editable, and portable PySpark code  Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data  Customisable: Use native PySpark, import custom libraries, and/or leverage Glue’s libraries  Collaborative: share code snippets via GitHub, reuse code across jobs
  • 43. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. How do I run ETL jobs?
  • 44. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Serverless Job Execution Auto-configure VPC & role-based access security & isolation preserved Customers can specify job capacity using Data Processing Units (DPU) Automatically scale resources Only pay for the resources you consume per-second billing (10-minute min) Customer VPC Customer VPC Compute instances
  • 45. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Data Processing Units (DPUs) 1 DPU = 4 vCPU + 16GB RAM Storage: • Free for the first million objects stored • $1 per 100,000 objects stored above 1M, per month Requests: • Free for the first million requests per month • $1 per million requests above 1M in a month
  • 46. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Job Composition: Example Data based >10 MB new ad-click logs Sales: Revenue by customer segment Schedule Central: ROI by customer segment weekly sales
  • 47. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Sounds Good In Theory… What’s It Really Like?
  • 48. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Demo context Amazon RDS Amazon S3 AWS Glue Amazon Redshift & Redshift Spectrum Amazon EMR Amazon Athena
  • 49. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. How AWS Glue Helps with ETL Automatically discover your data Generate ETL code Run your ETL jobs serverless
  • 50. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Speaker Contact Solutions Architect craigar@amazon.com Craig Roach