SlideShare uma empresa Scribd logo
1 de 34
© 2017, Amazon Web Services, Inc. or its Affiliates, All rights reserved.
Best Practices for Data Warehousing
with Amazon Redshift
Ganesh Raja
Specialist Solutions Architect Data & Analytics
Wednesday, 30th August 2017
Deep Dive Overview
• Amazon Redshift history and development
• Cluster architecture
• Concepts and terminology
• Storage deep dive
• New & upcoming features
Amazon Redshift History & Development
Columnar
MPP
OLAP
IAMAmazon
VPC
Amazon SWF
Amazon S3 AWS KMS Amazon
Route 53
Amazon
CloudWatch
Amazon
EC2
PostgreSQL
Amazon Redshift
February 2013
June 2017
> 100 Significant Patches
> 150 Significant Features
Amazon Redshift Cluster Architecture
Amazon Redshift Cluster Architecture
Massively parallel, shared nothing
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, backup, restore
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
S3 / EMR / DynamoDB / SSH
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Compute & Leader Node Components
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
Leader Node
• Parser & rewriter
• Planner & optimizer
• Code generator
• Input: optimized plan
• Output: >=1 C++
functions
• Compiler
• Task scheduler
• WLM
• Admission
• Scheduling
• PostgreSQL catalog tables
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
128GB RAM
16TB disk
16 cores
Compute Node
Leader Node
• Query execution processes
• Backup & restore processes
• Replication processes
• Local Storage
• Disks
• Slices
• Tables
• Columns
• Blocks
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Leader Node
Compute NodeCompute Node Compute Node
Concepts & Terminology
Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with row storage:
o Need to read everything
o Unnecessary I/O
Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
Designed for I/O Reduction
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with columnar storage
o Only scan blocks for relevant column
Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
CREATE TABLE deep_dive (
aid INT ENCODE LZO
,loc CHAR(3) ENCODE BYTEDICT
,dt DATE ENCODE RUNLENGTH
);
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Columns grow and shrink independently
• Reduces storage requirements
• Reduces I/O
Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt
CREATE TABLE deep_dive (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks which cannot
contain data for a given query
• Eliminates unnecessary I/O
Terminology and Concepts: Slices
A slice can be thought of like a “virtual compute node”
• Unit of data partitioning
• Parallel query processing
Facts about slices:
• Each compute node has either 2, 16, or 32 slices
• Table rows are distributed to slices
• A slice processes only its own data
Terminology and Concepts: Data Distribution
KEY
• The key creates an even distribution of data
• Joins are performed between large fact/dimension tables
• Optimizing merge joins and group by
ALL
• Small and medium size dimension tables (< 2-3M)
EVEN
• When key cannot produce an even distribution
Storage Deep Dive
Storage Deep Dive: Disks
• Amazon Redshift uses locally attached storage
devices
• Compute nodes have 2.5-3x the advertised storage capacity
• 1, 3, 8, or 24 disks depending on node type
• Each disk is split into two partitions
• Local data storage, accessed by local CN
• Mirrored data, accessed by remote CN
• Partitions are raw devices
• Local storage devices are ephemeral in nature
• Tolerant to multiple disk failures on a single node
Storage Deep Dive: Blocks
Column data is persisted to 1 MB immutable blocks
Each block contains in-memory metadata:
• Zone Maps (MIN/MAX value)
• Location of previous/next block
• Blocks are individually compressed with 1 of 11 encodings
A full block contains between 16 and 8.4 million values
Storage Deep Dive: Columns
• Column: Logical structure accessible via SQL
• Column properties include:
• Distribution Key
• Sort Key
• Compression Encoding
• Columns shrink and grow independently, 1 block at a time
• Three system columns per table-per slice for MVCC
New & Upcoming Features
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency:
Multiple clusters access
same data
No ETL: Query data in-
place using open file
formats
Full Amazon Redshift
SQL support
S3
SQL
Enter Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using
thousands of nodes
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
26
Analyze Subsets of Data Analyze ALL Available Data
Traditional Approach Redshift Spectrum Approach
Had to pick and choose which data you wanted to analyze
Analyze only the data that fits
in your data warehouse
Analyze any of the data in your
data lake
Paradigm Shift Enabled by Redshift Spectrum
Recently Released Features
Performance Enhancements
• Vacuum (10x faster for deletes)
• Snapshot Restore (2x faster)
• Queries (Up to 5x faster)
QMR - Query Monitoring Rules
• Apply rules to inflight queries
Enhanced VPC Routing
• Restrict S3 Bucket Access
BI tools SQL clientsAnalytics tools
Client AWS
Amazon
Redshift
ADFS
Corporate
Active Directory IAM
Amazon Redshift
ODBC/JDBC
User groups Individual user
Single Sign-On
Identity providers
New Amazon
Redshift
ODBC/JDBC
drivers. Grab the
ticket (userid) and
get a SAML
assertion.
Recently Released: IAM Authentication
Automatic and Incremental Background VACUUM
• Reclaims space and sorts when Amazon Redshift clusters are idle
• Vacuum is initiated when performance can be enhanced
• Improves ETL and query performance
Short Query Bias
• Prioritize interactive short running queries
Coming Soon: Lots More…
University of Technology
Sydney
Our Journey with Amazon Redshift
Graphicscreatedby
Ulluptaconecumrevolupta
evelignispeetdoluptuam.
31
University of Technology Sydney
• High Performance
• Scalable
• Ability to clone environments quickly and
easily
• Auto upgrades – no need to plan for
upgrades of the database
• Proactive support
• Reliable
• Low technical barrier
Our technical people need a environment that has
32
University of Technology Sydney
Examples of many use cases across the university where Amazon Redshift has enabled us
to meet the needs of the University
• High performance queries
• Make data available for analytics
and data discovery
• Ability to run queries against large
data sets
• As scalable as needed
33
What does this means for us?
Utilising familiar tools such as:
• IBM Cognos BI
• IBM Cognos TM1
• Microsoft Power BI
• SPSS and R
We now have a data platform allowing analytics and innovation in the cloud
Our project delivery is no longer limited to technical
capability. It is now only limited by our workforce capacity.
© 2017, Amazon Web Services, Inc. or its Affiliates, All rights reserved.
Thank you!

Mais conteúdo relacionado

Mais procurados

RocksDB detail
RocksDB detailRocksDB detail
RocksDB detailMIJIN AN
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfEric Xiao
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure DataTaro L. Saito
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialJean-François Gagné
 
Redis Introduction
Redis IntroductionRedis Introduction
Redis IntroductionAlex Su
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance SmackdownDataWorks Summit
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
 
Backup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryBackup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryMongoDB
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumTathastu.ai
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture ForumChristopher Spring
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance ImprovementBiju Nair
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponCloudera, Inc.
 

Mais procurados (20)

Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
RocksDB detail
RocksDB detailRocksDB detail
RocksDB detail
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication Tutorial
 
Redis Introduction
Redis IntroductionRedis Introduction
Redis Introduction
 
Redis introduction
Redis introductionRedis introduction
Redis introduction
 
Hadoop Query Performance Smackdown
Hadoop Query Performance SmackdownHadoop Query Performance Smackdown
Hadoop Query Performance Smackdown
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
Backup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryBackup, Restore, and Disaster Recovery
Backup, Restore, and Disaster Recovery
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
Redis overview for Software Architecture Forum
Redis overview for Software Architecture ForumRedis overview for Software Architecture Forum
Redis overview for Software Architecture Forum
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Presto
PrestoPresto
Presto
 

Semelhante a Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canberra

AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftAmazon Web Services
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftAmazon Web Services
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAmazon Web Services
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftAmazon Web Services
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)Amazon Web Services
 
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech TalksDesign, Deploy, and Optimize SQL Server on AWS - AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech TalksAmazon Web Services
 
Design, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech TalksDesign, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech TalksAmazon Web Services
 

Semelhante a Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canberra (20)

AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Data & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon RedshiftData & Analytics - Session 2 - Introducing Amazon Redshift
Data & Analytics - Session 2 - Introducing Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 
SRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon RedshiftSRV405 Deep Dive on Amazon Redshift
SRV405 Deep Dive on Amazon Redshift
 
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
AWS Summit London 2014 | Uses and Best Practices for Amazon Redshift (200)
 
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech TalksDesign, Deploy, and Optimize SQL Server on AWS - AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - AWS Online Tech Talks
 
Design, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech TalksDesign, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech Talks
Design, Deploy, and Optimize SQL Server on AWS - June 2017 AWS Online Tech Talks
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Best practices for Data warehousing with Amazon Redshift - AWS PS Summit Canberra

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates, All rights reserved. Best Practices for Data Warehousing with Amazon Redshift Ganesh Raja Specialist Solutions Architect Data & Analytics Wednesday, 30th August 2017
  • 2. Deep Dive Overview • Amazon Redshift history and development • Cluster architecture • Concepts and terminology • Storage deep dive • New & upcoming features
  • 3. Amazon Redshift History & Development
  • 4. Columnar MPP OLAP IAMAmazon VPC Amazon SWF Amazon S3 AWS KMS Amazon Route 53 Amazon CloudWatch Amazon EC2 PostgreSQL Amazon Redshift
  • 5. February 2013 June 2017 > 100 Significant Patches > 150 Significant Features
  • 6. Amazon Redshift Cluster Architecture
  • 7. Amazon Redshift Cluster Architecture Massively parallel, shared nothing Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, backup, restore 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores S3 / EMR / DynamoDB / SSH JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  • 8. Compute & Leader Node Components
  • 9. 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Compute Node 128GB RAM 16TB disk 16 cores Compute Node 128GB RAM 16TB disk 16 cores Compute Node Leader Node
  • 10. • Parser & rewriter • Planner & optimizer • Code generator • Input: optimized plan • Output: >=1 C++ functions • Compiler • Task scheduler • WLM • Admission • Scheduling • PostgreSQL catalog tables 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Compute Node 128GB RAM 16TB disk 16 cores Compute Node 128GB RAM 16TB disk 16 cores Compute Node Leader Node
  • 11. • Query execution processes • Backup & restore processes • Replication processes • Local Storage • Disks • Slices • Tables • Columns • Blocks 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Leader Node Compute NodeCompute Node Compute Node
  • 13. Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Accessing dt with row storage: o Need to read everything o Unnecessary I/O
  • 14. Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt Designed for I/O Reduction CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Accessing dt with columnar storage o Only scan blocks for relevant column
  • 15. Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt CREATE TABLE deep_dive ( aid INT ENCODE LZO ,loc CHAR(3) ENCODE BYTEDICT ,dt DATE ENCODE RUNLENGTH ); aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 • Columns grow and shrink independently • Reduces storage requirements • Reduces I/O
  • 16. Designed for I/O Reduction Columnar storage Data compression Zone maps aid loc dt 1 SFO 2016-09-01 2 JFK 2016-09-14 3 SFO 2017-04-01 4 JFK 2017-05-14 aid loc dt CREATE TABLE deep_dive ( aid INT --audience_id ,loc CHAR(3) --location ,dt DATE --date ); • In-memory block metadata • Contains per-block MIN and MAX value • Effectively prunes blocks which cannot contain data for a given query • Eliminates unnecessary I/O
  • 17. Terminology and Concepts: Slices A slice can be thought of like a “virtual compute node” • Unit of data partitioning • Parallel query processing Facts about slices: • Each compute node has either 2, 16, or 32 slices • Table rows are distributed to slices • A slice processes only its own data
  • 18. Terminology and Concepts: Data Distribution KEY • The key creates an even distribution of data • Joins are performed between large fact/dimension tables • Optimizing merge joins and group by ALL • Small and medium size dimension tables (< 2-3M) EVEN • When key cannot produce an even distribution
  • 20. Storage Deep Dive: Disks • Amazon Redshift uses locally attached storage devices • Compute nodes have 2.5-3x the advertised storage capacity • 1, 3, 8, or 24 disks depending on node type • Each disk is split into two partitions • Local data storage, accessed by local CN • Mirrored data, accessed by remote CN • Partitions are raw devices • Local storage devices are ephemeral in nature • Tolerant to multiple disk failures on a single node
  • 21. Storage Deep Dive: Blocks Column data is persisted to 1 MB immutable blocks Each block contains in-memory metadata: • Zone Maps (MIN/MAX value) • Location of previous/next block • Blocks are individually compressed with 1 of 11 encodings A full block contains between 16 and 8.4 million values
  • 22. Storage Deep Dive: Columns • Column: Logical structure accessible via SQL • Column properties include: • Distribution Key • Sort Key • Compression Encoding • Columns shrink and grow independently, 1 block at a time • Three system columns per table-per slice for MVCC
  • 23. New & Upcoming Features
  • 24. Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in- place using open file formats Full Amazon Redshift SQL support S3 SQL Enter Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes
  • 25. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore
  • 26. 26 Analyze Subsets of Data Analyze ALL Available Data Traditional Approach Redshift Spectrum Approach Had to pick and choose which data you wanted to analyze Analyze only the data that fits in your data warehouse Analyze any of the data in your data lake Paradigm Shift Enabled by Redshift Spectrum
  • 27. Recently Released Features Performance Enhancements • Vacuum (10x faster for deletes) • Snapshot Restore (2x faster) • Queries (Up to 5x faster) QMR - Query Monitoring Rules • Apply rules to inflight queries Enhanced VPC Routing • Restrict S3 Bucket Access
  • 28. BI tools SQL clientsAnalytics tools Client AWS Amazon Redshift ADFS Corporate Active Directory IAM Amazon Redshift ODBC/JDBC User groups Individual user Single Sign-On Identity providers New Amazon Redshift ODBC/JDBC drivers. Grab the ticket (userid) and get a SAML assertion. Recently Released: IAM Authentication
  • 29. Automatic and Incremental Background VACUUM • Reclaims space and sorts when Amazon Redshift clusters are idle • Vacuum is initiated when performance can be enhanced • Improves ETL and query performance Short Query Bias • Prioritize interactive short running queries Coming Soon: Lots More…
  • 30. University of Technology Sydney Our Journey with Amazon Redshift Graphicscreatedby Ulluptaconecumrevolupta evelignispeetdoluptuam.
  • 31. 31 University of Technology Sydney • High Performance • Scalable • Ability to clone environments quickly and easily • Auto upgrades – no need to plan for upgrades of the database • Proactive support • Reliable • Low technical barrier Our technical people need a environment that has
  • 32. 32 University of Technology Sydney Examples of many use cases across the university where Amazon Redshift has enabled us to meet the needs of the University • High performance queries • Make data available for analytics and data discovery • Ability to run queries against large data sets • As scalable as needed
  • 33. 33 What does this means for us? Utilising familiar tools such as: • IBM Cognos BI • IBM Cognos TM1 • Microsoft Power BI • SPSS and R We now have a data platform allowing analytics and innovation in the cloud Our project delivery is no longer limited to technical capability. It is now only limited by our workforce capacity.
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates, All rights reserved. Thank you!