AWS Data Lake: data analysis @ scale

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Giorgio Nobile, Solutions Architect
April 24, 2018
Data Lake: analisi di dati @
scale

Data Lake Challenges
Data variety and data volumes are increasing rapidly
Multiple Consumers and Applications
Ingest,
Discover,
Catalog,
Understand,
and Curate
all kinds of data
Quickly drive
new insights

Purpose-built engines.
Right tool for the right job.
Customer Needs Come First

Amazon’s Purpose-Built Analytics Offerings
Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
Snowball
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon Dynamo
DB
Amazon
Elasticsearch
Amazon EMR
Amazon
Redshift
Amazon
QuickSight
AWS Database Migration Service AWS Glue
Amazon Athena
Amazon AI

Traditional Data Warehouse
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
Relational data
Terabytes to Petabytes scale
Schema defined prior to data load
Operational reporting
and ad hoc analysis

Data Lakes extend traditional warehouses
Relational and non-relational data
Terabytes to Exabytes scale
Schema defined during analysis
(Schema on Read)
Diverse analytical engines to gain insights
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111
0010101011100101010
0001011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Data Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time

Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Wide variety of ways to bring data in
Durability and availability at Exabyte scale
Security, compliance, and audit capabilities
Run any analytics on the same data without
movement
Scale storage and compute independently
Store at $0.023 / GB-month
Query for $0.05 / GB scanned
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Data Lakes on AWS

Data Lake Architecture
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight A
Central Storage
Secure, cost-effective
Storage in Amazon S3
Glue ETL

Serverless Analytics
Deliver cost-effective analytic solutions faster
S3
Data Lake
Glue
(Data Catalog
and ETL)
RedShift
Spectrum
QuickSight
Serverless
Zero infrastructure
Zero administration
Pay only for what
you use, not for
idle resources
$
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
Snowball
Snowmobile
Kinesis
Data Firehose
many other
sources
Other BI Tools
Amazon
Athena
Amazon
EMR

© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Storing is not enough, data needs to be discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business
relationships and direct
monetizing).
Gartner
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”

AWS Glue—Serverless Data Catalog & ETL
Data Catalog ETL
Discover data and
extract schema
Auto-generate
customizable code
in Python and Spark
Automatically discovers data and stores schema
Data is immediately searchable
and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless

Crawlers: Automatic Schema Inference
semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char int
custom classifiers
Grok based parser
built-in classifiers
JSON parser
CSV parser
Parquet parser
…
bool

IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
XML
JSON & BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressions
(ZIP, BZIP, GZIP, LZ4, Snappy)
What can Crawlers Classify?

Detecting Schema Similarity
name:
str
id: num
Schema A
root
addr
street: str city: str zip: num
name:
str
id: num
Schema B
root
addr: str
Schema similarity heuristic
 1 point for matching name
 1 point for matching data type
 Match when similarity index > 0.7
intersection
min(A,B)
7
8
.875sim

Custom Classifiers
1. Write a custom classifier 2. Add it to your crawler

Available partitions
Automatically Detect Partitions

Automatically update table version as data evolves
Automatic Schema Versioning

Other Ways of Creating Tables
Call Glue’s CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore

Amazon Redshift—Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at any scale
Columnar storage
technology to improve I/O
efficiency and scale query
performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional data
warehouse solutions;
Start at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end;
extensive certification and
compliance
Analyze optimized data
formats on the latest SSD,
and all open data formats in
Amazon S3

New Dense Compute Node - DC2
2X Performance @ Same Price as DC1
3x more I/O with 30% better storage utilization than DC1
“We saw a 9x reduction in month-end reporting time
with Redshift DC2 nodes as compared to DC1”
- Bradley Todd,
Technical Architect, Liberty Mutual
NVMe SSD DDR4 Memory Intel E5-2686 v4 (Broadwell)

Amazon Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n a n S 3 d a t a l a k e
Exabyte Redshift SQL queries against S3
Join data across Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited
concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned
S3 data lakeRedshift data
Redshift Spectrum
query engine

Load
Unload
Backup
Restore
Amazon Redshift Spectrum Architecture
Massively parallel, shared nothing columnar
architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, backup, restore
Amazon Redshift Spectrum nodes
• Execute queries directly against
Amazon S3
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Amazon S3
...
1 2 3 4 N
Amazon
Redshift
Spectrum

Redshift Spectrum
Quer y your data lak e
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Redshift Spectrum
Scale-out serverless compute
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY …

Defining Data Lake Artifacts (Sc hema on Read)
Define an external schema in Amazon Redshift using the Data Catalog
CREATE external schema archived_trips
from data catalog database 'sampledb'
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
region 'us-east-2';
View External Schemas
select * from svv_external_schemas
View External Schemas
select * from svv_external_tables

Data Lake on Amazon S3 with AWS Glue
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL

AWS
Glue
Amazon
S3
Data
sources
Amazon
Redshift
Redshift
Spectrum
BI Tools
“Amazon Redshift Spectrum is a game changer for us. Reports that took minutes to produce
are now delivered in seconds. We like the ability scale compute on-demand to query Petabytes
of data in S3 in various open file formats.”
-- Rafi Ton, CEO, NUVIAD
NUVIAD is a marketing platform that helps media buyers optimize their mobile bidding
Use AWS for marketing campaign and bidding analytics
Scale S3 storage for unlimited data capacity
Use Spectrum for unlimited scale and query concurrency
80% performance gain using Parquet data format
—Data Lak e Analytic s with Reds hift Spec trum

Demonstration

Redshift IAM Callout.
Amazon Redshift needs authorization to access the Data Catalog in AWS
Glue and the data files in Amazon S3 on your behalf.
To provide that authorization, you first create an AWS Identity and Access
Management (IAM) role.
Then you attach the role to your cluster and provide Amazon Resource Name
(ARN) for the role in the Amazon Redshift CREATE EXTERNAL SCHEMA statement.

Recommendations – 1 of 4
Use Amazon Redshift Spectrum to improve scan-intensive
concurrent workloads
Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers
that are independent of your cluster.
It pushes many compute-intensive tasks, such as predicate filtering and
aggregation, down to the Amazon Redshift Spectrum layer, so queries use
much less of your cluster’s processing capacity.

Query Optimize your Data Lake - Use Apache Parquet
Pay special attention to two interesting metrics:
• s3_scanned_rows
• s3query_returned_rows
You will notice a tremendous reduction in the amount of data that returns
from Amazon Redshift Spectrum to Amazon Redshift for the final processing
when compared to CSV files.

Query Optimize your Data Lake - Partition Parquet files
The following SQL to analyze the effectiveness of partition pruning.
If the query touches only a few partitions, you can verify if everything
behaves as expected:
• SELECT query, segment, max(assigned_partitions) as total_partitions,
max(qualified_partitions) as qualified_partitions FROM svl_s3partition WHERE
query=<Query-ID> GROUP BY 1,2;

Improve Data Lake query performance with predicate pushdown
There are certain SQL operations that can be pushed down to the Amazon Redshift Spectrum layer.
You want to take advantage of these wherever possible. For example:
• GROUP BY clauses and some string functions
• Equal predicates and pattern-matching conditions such as LIKE
• Common aggregate functions such as COUNT, SUM, AVG, MIN, MAX, and many others
• Functions such as regex_replace and many others
Certain SQL operations like DISTINCT and ORDER BY must be performed in Amazon Redshift
because they can’t be pushed down to Amazon Redshift Spectrum. If possible, you should minimize
or avoid using them.
NOTE: Replace DISTINCT with GROUP BY

Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Building Big Data Storage
Solutions (Data Lakes) for
Maximum Flexibility -
https://d1.awsstatic.com/whitepap
ers/Storage/data-lake-on-aws.pdf
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Data Lakes on AWS
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Building Big Data Storage
Solutions (Data Lakes) for
Maximum Flexibility -
https://d1.awsstatic.com/whitepap
ers/Storage/data-lake-on-aws.pdf
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Dat a Lakes on AWS

Thank you!

AWS Data Lake: data analysis @ scale

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a AWS Data Lake: data analysis @ scale

Semelhante a AWS Data Lake: data analysis @ scale (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

AWS Data Lake: data analysis @ scale