Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canberra

© 2017, Amazon Web Services, Inc. or its Affiliates, All rights reserved.
Best Practices for Data Warehousing
with Amazon Redshift
Ganesh Raja
Specialist Solutions Architect Data & Analytics
Wednesday, 30th August 2017

Deep Dive Overview
• Amazon Redshift history and development
• Cluster architecture
• Concepts and terminology
• Storage deep dive
• New & upcoming features

Amazon Redshift History & Development

Columnar
MPP
OLAP
IAMAmazon
VPC
Amazon SWF
Amazon S3 AWS KMS Amazon
Route 53
Amazon
CloudWatch
Amazon
EC2
PostgreSQL
Amazon Redshift

February 2013
June 2017
> 100 Significant Patches
> 150 Significant Features

Amazon Redshift Cluster Architecture

Amazon Redshift Cluster Architecture
Massively parallel, shared nothing
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node

Compute & Leader Node Components

128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
Leader Node

• Parser & rewriter
• Planner & optimizer
• Code generator
• Input: optimized plan
• Output: >=1 C++
functions
• Compiler
• Task scheduler
• WLM
• Admission
• Scheduling
• PostgreSQL catalog tables
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
Leader Node

• Query execution processes
• Backup & restore processes
• Replication processes
• Local Storage
• Disks
• Slices
• Tables
• Columns
• Blocks
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Leader Node
Compute NodeCompute Node Compute Node

Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with row storage:
o Need to read everything
o Unnecessary I/O

Columnar storage
Data compression
Zone maps
aid loc dt
,dt DATE --date
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with columnar storage
o Only scan blocks for relevant column

Columnar storage
Data compression
Zone maps
aid loc dt
aid INT ENCODE LZO
,loc CHAR(3) ENCODE BYTEDICT
,dt DATE ENCODE RUNLENGTH
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Columns grow and shrink independently
• Reduces storage requirements
• Reduces I/O

Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt
,dt DATE --date
);
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks which cannot
contain data for a given query
• Eliminates unnecessary I/O

Terminology and Concepts: Slices
A slice can be thought of like a “virtual compute node”
• Unit of data partitioning
• Parallel query processing
Facts about slices:
• Each compute node has either 2, 16, or 32 slices
• Table rows are distributed to slices
• A slice processes only its own data

Terminology and Concepts: Data Distribution
KEY
• The key creates an even distribution of data
• Joins are performed between large fact/dimension tables
• Optimizing merge joins and group by
ALL
• Small and medium size dimension tables (< 2-3M)
EVEN
• When key cannot produce an even distribution

Storage Deep Dive: Disks
• Amazon Redshift uses locally attached storage
devices
• Compute nodes have 2.5-3x the advertised storage capacity
• 1, 3, 8, or 24 disks depending on node type
• Each disk is split into two partitions
• Local data storage, accessed by local CN
• Mirrored data, accessed by remote CN
• Partitions are raw devices
• Local storage devices are ephemeral in nature
• Tolerant to multiple disk failures on a single node

Storage Deep Dive: Blocks
Column data is persisted to 1 MB immutable blocks
Each block contains in-memory metadata:
• Zone Maps (MIN/MAX value)
• Location of previous/next block
• Blocks are individually compressed with 1 of 11 encodings
A full block contains between 16 and 8.4 million values

Storage Deep Dive: Columns
• Column: Logical structure accessible via SQL
• Column properties include:
• Distribution Key
• Sort Key
• Compression Encoding
• Columns shrink and grow independently, 1 block at a time
• Three system columns per table-per slice for MVCC

Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency:
Multiple clusters access
same data
No ETL: Query data in-
place using open file
formats
Full Amazon Redshift
SQL support
S3
SQL
Enter Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using
thousands of nodes

Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore

26
Analyze Subsets of Data Analyze ALL Available Data
Traditional Approach Redshift Spectrum Approach
Had to pick and choose which data you wanted to analyze
Analyze only the data that fits
in your data warehouse
Analyze any of the data in your
data lake
Paradigm Shift Enabled by Redshift Spectrum

Recently Released Features
Performance Enhancements
• Vacuum (10x faster for deletes)
• Snapshot Restore (2x faster)
• Queries (Up to 5x faster)
QMR - Query Monitoring Rules
• Apply rules to inflight queries
Enhanced VPC Routing
• Restrict S3 Bucket Access

BI tools SQL clientsAnalytics tools
Client AWS
Amazon
Redshift
ADFS
Corporate
Active Directory IAM
Amazon Redshift
ODBC/JDBC
User groups Individual user
Single Sign-On
Identity providers
New Amazon
Redshift
ODBC/JDBC
drivers. Grab the
ticket (userid) and
get a SAML
assertion.
Recently Released: IAM Authentication

Automatic and Incremental Background VACUUM
• Reclaims space and sorts when Amazon Redshift clusters are idle
• Vacuum is initiated when performance can be enhanced
• Improves ETL and query performance
Short Query Bias
• Prioritize interactive short running queries
Coming Soon: Lots More…

University of Technology
Sydney
Our Journey with Amazon Redshift
Graphicscreatedby
Ulluptaconecumrevolupta
evelignispeetdoluptuam.

31
University of Technology Sydney
• High Performance
• Scalable
• Ability to clone environments quickly and
easily
• Auto upgrades – no need to plan for
upgrades of the database
• Proactive support
• Reliable
• Low technical barrier
Our technical people need a environment that has

32
University of Technology Sydney
Examples of many use cases across the university where Amazon Redshift has enabled us
to meet the needs of the University
• High performance queries
• Make data available for analytics
and data discovery
• Ability to run queries against large
data sets
• As scalable as needed

33
What does this means for us?
Utilising familiar tools such as:
• IBM Cognos BI
• IBM Cognos TM1
• Microsoft Power BI
• SPSS and R
We now have a data platform allowing analytics and innovation in the cloud
Our project delivery is no longer limited to technical
capability. It is now only limited by our workforce capacity.

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canberra

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canberra

Semelhante a Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canberra (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canberra