Immersion Day - Como gerenciar seu catálogo de dados e processo de transformação

© 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Daniel Bento, Solutions Architect
Working Within the Data Lake

Working Within the Data Lake
1. Optimizing for Cost and Performance
2. Cataloging Data Schemas with AWS Glue
3. Transforming Data with AWS Glue
4. Querying the Data Lake with Amazon Athena
5. Visualizing the Data Lake with Amazon QuickSight

AWS Data lake and Bigdata Services
Catalog & Search Access & User Interfaces
Data Ingestion
Analytics & Serving
S3
Amazon
DynamoDB
Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
Manage & Secure
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
Central Storage
Scalable, secure, cost-
effective
AWS
Glue

Managed Services
Pay for what you use, not
for what you run
Partitioning
Pay for data your query needs,
not to scan all of your data
Compression
Pay for what you store,
not for what you process
Optimizing for Cost and Performance

Partitioning

select * from datalake where
dt=20170515 and country=US
datalake
├── 20170515T1423-GB-01.tar.gz
├── 20170515T1423-GB-02.tar.gz
├── 20170515T1500-US-01.tar.gz
├── 20170516T1500-US-01.tar.gz
├── 20170516T1600-GB-01.tar.gz
└── 20170516T1600-GB-02.tar.gz
datalake
├── dt=20170515
│ ├── country=GB
│ │ ├── 20170515T1423-GB-01.tar.gz
│ │ └── 20170515T1423-GB-02.tar.gz
│ └── country=US
│ └── 20170515T1500-US-01.tar.gz
└── dt=20170516
├── country=GB
│ ├── 20170516T1600-GB-01.tar.gz
│ └── 20170516T1600-GB-02.tar.gz
└── country=US
└── 20170516T1500-US-01.tar.gz
Partitioning

select count(*) from datalake where
dt=‘20170515’
select count(*) from datalake where
dt >= ‘20170515’ and dt < ‘20170516’
Non-Partitioned Partitioned Non-Partitioned Partitioned
Run Time 9.71 sec 2.16 sec 10.41 sec 2.73 sec
Data
Scanned
74.1 GB 29.06 MB 74.1 GB 871.39 MB
Cost $0.36 $0.0001 $0.36 $0.004
Results 77% faster, 99% cheaper 73% faster, 98% cheaper
Partitioning - Advantages

Compression

• Compressing your data can speed up your queries significantly
• Splittable formats enable parallel processing across nodes
Algorith
m
Splittable Compressio
n Ratio
Algorithm
Speed
Good For
Gzip
(DEFLAT
E)
No High Medium Raw Storage
bzip2 Yes Very High Slow Very Large Files
LZO Yes Low Fast Slow Analytics
Snappy Yes and No
*
Low Very Fast Slow & Fast
Analytics
* Depends on if the source format is splittable and can output each record into a Snappy Block
Compression

Compression - Example
Snappy Compression with Parquet File Format
Format Size on S3 Run Time Data
Scanned
Cost
Text 1.15 TB 3m 56s 1.15 TB $5.75
Parquet 130 GB 6.78s 2.51 GB $0.013
Result 87% less 34x faster 99% less 99.7%
savings

Compression – File Counts
Fewer, larger files are better than many, smaller files (when splittable)
• Faster Listing Operations
• Fewer Requests to Amazon S3
• Less Metadata to Manage
• Faster Query Performance
Query # files Run Time
select count(*) from
datalake
5000 files 8.4 sec
select count(*) from
datalake
1 file 2.31 sec
Result 72% Faster

Cataloging and Transforming
Data
Working with AWS Glue

Transforming Data
Over 90% of ETL jobs in the cloud are hand-coded
Which is good …
• Flexible
• Powerful
• Unit Tests
• CI/CD
• Developer Tools …
… but also bad!
• Brittle
• Error-Prone
• Laborious
• Sources Change
• Schemas Change
• Volume Changes
• EVERYTHING KEEPS CHANGING !!!

Discover
Develop
Deploy
Automatically discover and categorize your data making it immediately
searchable and queryable across data sources
Generate code to clean, enrich, and reliably move data between various
data sources
Run your jobs on a serverless, fully managed, scale-out environment.
No compute resources to provision or manage.

AWS Glue: Components
 Hive Metastore compatible with enhanced functionality
 Crawlers automatically extracts metadata and creates tables
 Integrated with Amazon Athena, Amazon Redshift Spectrum, Amazon
EMR
 Run jobs on a serverless Spark platform
 Provides flexible scheduling
 Handles dependency resolution, monitoring and alerting
 Auto-generates ETL code
 Build on open frameworks – Python/Scala and Apache Spark
 Developer-centric – editing, debugging, sharingJob Authoring
Data Catalog
Job Execution

Glue: Data Catalog
Features Include
 Search metadata for data
discovery
 Connections to S3, RDS,
Redshift, JDBC, and DynamoDB
 Classification for identifying and
parsing files
 Versioning of table metadata as
schemas

Glue Data
Catalog
Glue: Data Catalog – Queryable by Many Services
Glue ETL
Amazon Athena
Redshift Spectrum
EMR
(Hadoop/Spark)

Glue: Data Catalog - Crawlers
Features Include:
• Built-in classifiers
• Detect file type
• Extract schema
• Identify partitions
• On-Demand or Scheduled
Execution
• Build-you-own classifiers
• Grok for ease of use

Crawlers: Classifiers
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon RDS
Amazon Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
JSON & BJSON
Logs
(Apache, Linux, MS, Ruby, Redis, and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressed Formats
(ZIP, BZIP, GZIP, LZ4, Snappy)
Create additional Custom Classifiers
with Grok!

AWS Glue Data Catalog
Bring in metadata from a variety of data sources (Amazon S3, Amazon Redshift, etc.) into a single
categorized list that is searchable

Data Catalog: Table details
Table schema
Table properties
Data statistics
Nested fields

Data Catalog: Version control
List of table versionsCompare schema versions

file 1 file N… file 1 file N…
date=10 date=15…
month=Nov
S3 bucket hierarchy Table definition
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
sim=.99 sim=.95
sim=.93
month
date
col 1
col 2
str
str
int
float
Column Type
Glue: Crawlers – Detecting Schema and Partitions

Data Catalog: Automatic partition detection
Automatically register available partitions
Table
partitions

AWS Glue: Job Authoring - Pick a Source
Choose from Manually-Defined or Crawler-Generated Sources:
• S3 Bucket
• RDBMS
• Redshift
• DynamoDB

AWS Glue: Job Authoring – Pick a Target
Write output results into an existing table.
Crawler-Defined or Manually-Created:
• S3 Bucket
• RDBMS
• Redshift

AWS Glue: Job Authoring – Code Generation
Existing columns
in target
Can extend/add
new columns to
target

1. Customize the mappings
2. Glue generates transformation graph and either Python or Scala code
3. Specify trigger condition
AWS Glue: Job Authoring – Code Generation

 Human-readable, editable, and portable PySpark or Scala code
 Flexible: Glue’s ETL library simplifies manipulating complex, semi-structured data
 Customizable: Use native PySpark / Scala, import custom libraries, and/or leverage Glue’s
libraries
 Collaborative: share code snippets via GitHub, reuse code across jobs
Job authoring: ETL code

Compose jobs globally with event-
based dependencies
 Easy to reuse and leverage work across
organization boundaries
Multiple triggering mechanisms
 Schedule-based: e.g., time of day
 Event-based: e.g., job completion
 On-demand: e.g., AWS Lambda
Logs and alerts are available in
Amazon CloudWatch
Marketing: Ad-spend by
customer segment
Event Based
Lambda Trigger
Sales: Revenue by
customer segment
Schedule
Data
based
Central: ROI by
customer
segment
Weekly
sales
Data
based
AWS Glue Orchestration

• Track ETL job progress and inspect logs directly from the console.
• Logs are written to CloudWatch for simple access.
• Errors are automatically extracted and presented in the Error Logs for
easy troubleshooting of jobs.
AWS Glue: Job Execution – Progress and History

AWS Glue: Overall Flow
1. Crawl your raw
data
2. Create your desired targets
3. Generate and prep your ETL
4. Execute your job

Querying the Data Lake
Working with Amazon Athena

An interactive query service that makes it easy to analyze
data directly from Amazon S3 using Standard SQL

What does it look like?

Athena is Serverless
• No Infrastructure or
administration
• Zero Spin up time
• Transparent upgrades

Use ANSI SQL
• Support for complex joins,
nested queries & window
functions
• Support for complex data types
(arrays, structs)
• Support for partitioning of data
by any key

Familiar Technologies Under the Covers
Used for SQL Queries
In-memory distributed query engine
ANSI-SQL compatible with extensions
Used for DDL functionality
Complex data types
Multitude of formats
Supports data partitioning

Amazon Athena is Cost Effective
• Pay per query
• $5 per TB scanned from S3
• DDL Queries and failed queries are free
• Save by using compression, columnar formats, partitions

Serverless data lake & analytics with AWS
Crawlers scan your data sets and populate the Glue Data Catalog
The Glue Data Catalog serves as a central metadata repository
Once catalogued in Glue, your data is immediately available for analytics
1
2
3
AMAZON
REDSHIFT
SPECTRUM
AMAZON
EMR
AMAZON
ATHENA
AWS GLUE
DATA CATALOG
AWS GLUE
CRAWLER
AMAZON S3 QUICKSIGHT
1 2
3
https://github.com/aws-samples/serverless-data-analytics

Visualizing the Data Lake
Working with Amazon QuickSight

QuickSight Overview
Scalable
QuickSight automatically scales with your usage and
activity, with no need for additional infrastructure.
From 10 users to 10,000 QuickSight seamlessly grows
with you
No Servers to Manage
QuickSight is a fully managed cloud application,
meaning there's no upfront cost, software to deploy,
capacity planning, maintenance, upgrades, or
migrations
Fully integrated
QuickSight integrates with your data sources and
other AWS services like Redshift, S3, Athena, Aurora,
RDS, IAM, CloudTrail, Cloud Directory and more –
providing you with everything you need to build an
end-to-end BI solution.
Pay For What You Use
Provide read-only access to interactive dashboards and
pay only when your users access them with pay-per-
session pricing. There are no upfront costs, no annual
commitments, and no charges for inactive users.

Connect to your data, wherever it is
QuickSight is natively integrated with AWS data sources, as well as on-premise and hosted
databases and third party business applications
On-premises
Securely connect to on-premise
databases and flat files like
Excel and CSV
In the cloud
Connect to hosted database, big
data formats, and secure VPCs
Applications
Connect directly to third
party business applications
• Salesforce
• Square
• Adobe Analytics
• Jira
• ServiceNow
• Twitter
• Github
• Redshift
• RDS
• S3
• Athena
• Aurora
• Teradata
• MySQL
• Presto
• Spark
• SQL Server
• Postgre SQL
• MariaDB
• Snowflake
• IoT Analytics
• Excel
• CSV
• Teradata
• MySQL
• SQL Server
• PostgreSQL

Preview data
Rename, remove fields, change data types
Create new calculated fields
Filter rows
Issue direct query or ingest to SPICE
Visually join tables from the same relational DB
Push down custom SQL queries
Data Prep

SPICE
QuickSight is powered by SPICE, a super-fast calculation engine that delivers
performance and scale, regardless of how many users are active.
SPICEYour Data Source

User Types / User Roles
Admin
Manage Users
Manage SPICE Capacity
Manage VPC Connections
Manage Account Settings
Author
Create Data Sets
Create Analyses
Create Dashboards
Reader
Consume Dashboards
QS Admin
Sometimes separate from Business Users,
sometimes the same
Usually has AWS Console access
Business User
Anyone
Can be internal or external users
(customers/partners3rd parties)
Analyst
Sometimes in IT, sometimes Business Users
‘Data Analyst’
‘Data Engineer’
‘BI Engineer’

Data governance
Create managed datasets that give power users and authors the flexibility to
perform self-serve analytics on data that you control.
Create datasets that:
• Can be shared with any user
• Automatically refresh
• Have row level security
• Users cannot modify
• Dynamically update
with changes

Sample Embedded Dashboards

Anomaly Detection

Immersion Day - Como gerenciar seu catálogo de dados e processo de transformação

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Immersion Day - Como gerenciar seu catálogo de dados e processo de transformação

Semelhante a Immersion Day - Como gerenciar seu catálogo de dados e processo de transformação (20)

Mais de Amazon Web Services LATAM

Mais de Amazon Web Services LATAM (20)

Último

Último (20)

Immersion Day - Como gerenciar seu catálogo de dados e processo de transformação