Best Practices for Migrating your Data Warehouse to Amazon Redshift
1. Best Practices for Migrating your
Data Warehouse to Amazon
Redshift
Vikram Gangulavoipalyam, Enterprise Solutions Architect,
June 13th 2017
2. Migrating your Data Warehouse Overview
• Why Migrate
• Customer Success Stories
• Amazon Redshift History and Development
• Cluster Architecture
• Migration Best Practices
• Migration Tools
• Open Q&A
5. 100x faster
Scales from GBs to PBs
Analyze data without storage
constraints
10x cheaper
Easy to provision and operate
Higher productivity
10x faster
No programming
Standard interfaces and
integration to leverage BI tools,
machine learning, streaming
Transactional database MPP database Hadoop
Why Migrate to Amazon Redshift?
9. Migration from Greenplum @ NTT Docomo
68 million customers
10s of TBs per day of data across
mobile network
6PB of total data (uncompressed)
Data science for marketing
operations, logistics etc.
Legacy DW: Greenplum on-premises
After migration:
125 node DS2.8XL cluster
4,500 vCPUs, 30TB RAM
6 PB uncompressed
10x faster analytic queries
50% reduction in time for new BI
app. deployment
Significantly less ops. overhead
10. Migration from SQL on Hadoop @ Yahoo
Analytics for website/mobile events
across multiple Yahoo properties
On an average day
2B events
25M devices
Before migration: Hive – Found it to be
slow, hard to use, share and repeat
After migration:
21 node DC1.8XL (SSD)
50TB compressed data
100x performance improvement
Real-time insights
Easier deployment and
maintenance
11. Migration from SQL on Hadoop @ Yahoo
1
10
100
1000
10000
Count
Distinct
Devices
Count All
Events
Filter
Clauses
Joins
Seconds
Amazon Redshift
Impala
13. ENGINE X Amazon Redshift
ETL Scripts
SQL in reports
Adhoc. queries
How to Migrate?
Schema Conversion Database Migration
Map data types
Choose compression
encoding, sort keys,
distribution keys
Generate and apply DDL
Schema & Data
Transformation
Data Migration
Convert SQL Code
Bulk Load
Capture updates
Transformations
Assess Gaps
Stored Procedures
Functions
1 2
3
4
14. If you forget everything else…
• Lift-and-Shift is NOT an ideal approach
• Temptation to use the existing investment in ETL/Process
• Unique Characteristics to Redshift
• AWS has a rich ecosystem of solutions
• Your final solution will use other AWS services
• AWS Solution Architects, ProServ, and Partners can help
20. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with row storage:
– Need to read everything
– Unnecessary I/O
aid loc dt
CREATE TABLE loft_migration (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
21. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Accessing dt with columnar storage:
– Only scan blocks for relevant
column
aid loc dt
CREATE TABLE loft_migration (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
22. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
• Columns grow and shrink independently
• Effective compression ratios due to like data
• Reduces storage requirements
• Reduces I/O
aid loc dt
CREATE TABLE loft_migration (
aid INT ENCODE LZO
,loc CHAR(3) ENCODE BYTEDICT
,dt DATE ENCODE RUNLENGTH
);
23. Designed for I/O Reduction
Columnar storage
Data compression
Zone maps
aid loc dt
1 SFO 2016-09-01
2 JFK 2016-09-14
3 SFO 2017-04-01
4 JFK 2017-05-14
aid loc dt
CREATE TABLE loft_migration (
aid INT --audience_id
,loc CHAR(3) --location
,dt DATE --date
);
• In-memory block metadata
• Contains per-block MIN and MAX value
• Effectively prunes blocks which cannot
contain data for a given query
• Eliminates unnecessary I/O
25. Terminology and Concepts: Slices
A slice can be thought of like a “virtual compute node”
– Unit of data partitioning
– Parallel query processing
Facts about slices:
– Each compute node has either 2, 16, or 32 slices
– Table rows are distributed to slices
– A slice processes only its own data
26. Terminology and Concepts: Data Distribution
• Distribution style is a table property which dictates how that table’s data is
distributed throughout the cluster:
• KEY: Value is hashed, same value goes to same location (slice)
• ALL: Full table data goes to first slice of every node
• EVEN: Round robin
• Goals:
• Distribute data evenly for parallel processing
• Minimize data movement during query processing
KEY
ALL
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
EVEN
27. DS2.8XL Compute Node
Ingestion Throughput:
– Each slice’s query processors can load one file at a time:
Streaming decompression
Parse
Distribute
Write
Realizing only partial node usage as 6.25% of slices are active
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
Data Loading Best Practices
28. Use at least as many input
files as there are slices in the
cluster
With 16 input files, all slices
are working so you maximize
throughput
COPY continues to scale
linearly as you add nodes
16 Input Files
DS2.8XL Compute Node
0 2 4 6 8 10 12 141 3 5 7 9 11 13 15
Data Loading Best Practices Continued
29. Data Preparation
Export Data from Source System
– CSV Recommend (Delimiter '|')
Be aware of UTF-8 varchar columns (UTF-8 take 4 bytes per char)
Be aware of your NULL character (N)
– GZIP Compress Files
– Split Files (1MB – 1GB after gzip compression)
Useful COPY Options for PoC Data
– MAXERRORS
– ACCEPTINVCHARS
– NULL AS
30. Keep Columns as Narrow as Possible
• Buffers allocated based on declared
column width
• Wider than needed columns mean
memory is wasted
• Fewer rows fit into memory; increased
likelihood of queries spilling to disk
• Check
SVV_TABLE_INFO(max_varchar)
• SELECT max(len(col)) FROM table
31. Amazon Redshift is a Data Warehouse
Optimized for batch inserts
– The time to insert a single row in Redshift is roughly the same as inserting
100,000 rows
Updates are delete + insert of the row
• Deletes mark rows for deletion
Blocks are immutable
– Minimum space used is one block per column, per slice
32. Auto Compression
– Samples data automatically when COPY into an empty table
Samples up to 100,000 rows and picks optimal encoding
– Turn off Auto Compression for Staging Tables
Bake encodings into your DDL or use CREATE TABLE (LIKE …)
Analyze Compression
– Data profile has changed
– Run after changing sort key
Column Compression
34. Primary/Unique/Foreign Key Constraints
Primary/Unique/Foreign Key constraints are NOT enforced
– If you load data multiple times, Amazon Redshift won’t complain
– If you declare primary keys in your DDL, the optimizer will expect the data to
be unique
Redshift optimizer uses declared constraints to pick optimal plan
– In certain cases it can result in performance improvements
39. Start your first migration in few minutes
Sources include: Aurora, Oracle, SQL
Server, MySQL and PostgreSQL
Bulk load and continuous replication
Migrate a TB for $3
Fault tolerant
(AWS DMS)
44. Resources
https://github.com/awslabs/amazon-redshift-utils
https://github.com/awslabs/amazon-redshift-monitoring
https://github.com/awslabs/amazon-redshift-udfs
Admin scripts
Collection of utilities for running diagnostics on your cluster
Admin views
Collection of utilities for managing your cluster, generating schema DDL, etc.
ColumnEncodingUtility
Gives you the ability to apply optimal column encoding to an established schema
with data already loaded
Amazon Redshift Engineering’s Advanced Table Design Playbook
https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-
playbook-preamble-prerequisites-and-prioritization/
We will start with the reasons and rationale to migrate to Redshift and we will do it with some customer success stories.
We will in fact get into specific details of a couple of use cases to drive home the point about the value of migrating to Redshift.
We will then briefly touch upon the genesis of Redshift as a technology and it’s architecture. We will not be doing deep dive in to architecture itself.
Instead we will focus on the components that play an important role in the actual migration of data into the Redshift environment.
The meat of this presentation will be focused though on Migration best practices and Migration tools.
At the end, we will open up the floor for any questions.
What are some competencies of Redshift as a DW that should deem an actual migration? How do we actually quantify the value of under taking such an initiative?
Our view of data warehousing and why we built Amazon Redshift.
We want to make data warehousing fast, inexpensive and easy to use for companies of all sizes
For Enterprises, benefits are: 1/10th cost, easy to provision and manage, productive DBAs
For Big data companies: 10x faster, use standard SQL-no need to write custom code, use existing BI tools
For SaaS: analysis in-line with process flows, pay as you go, grow as you need, managed availability & DR – back up to s3 continuously and asynchronous Snapshot to s3 in a different region)
Nasdaq security – Ingests 5B rows/trading day, analyzes orders, trades and quotes
Yelp – 10s of millions of ads/day, 250M mobile events/day, analyzes ad campaign performance and new feature usage
Desk.com – Ingests 200K case related events/hour and runs a user facing portal on Redshift
Lets look at a couple of use cases in detail.
As an example, Boingo takes advantage of multiple Amazon Kinesis Streams to ingest streaming data into Amazon Redshift. Boingo also uses Amazon Kinesis Streams to process data from thousands of wireless LAN controllers and place it into Amazon Redshift.
Japan based major telecom retailer
They experienced performance challenges as the amount of data grew to multiple petabytes.
To address the challenges, the company sought to move some of its data analysis workload to a cloud-based solution.
Elasticity to support sudden spikes in traffic significantly reduced the operational overhead that they had had previously.
How many of you know HIVE?
Mobile events were being stored on-prem in a hadoop cluster
Talk about additional admin cost of commissioning and de-commissioning node in the cluster
Majority of the KPIs that the customer was interested in included aggregate data across multiple joins.
Evaluation – Find out the characteristics of the technology, understand how the technology value maps to the business value required through KPIs and metrics
Then, perform a PoC and validate the KPIs defined.
Design a migration plan using the best practices we will be discussing in the next few slides, validate the plan, execute it and finally operationalize it.
Secondly, the tool can automatically re-write code which is stored inside the database; views, stored procedures, triggers and functions are automatically evaluated and where possible, we’ll go ahead and convert them to their equivalent in the new database; where we can’t do this accurately, we’ll highlight it and make smart recommendations (providing links to the docs) to guide you through the process manually. So when coupled with the Database Migration Service, the Schema Conversion tool will help you go from whatever database you’re on now, to a more open database in the Cloud with remarkably low effort in terms of skilled DBA time and cost, giving you an ‘out’ for your current database relationship woes
There are few specific knobs that need to be tuned that are specific to Redshift design.
Compression/Encoding
Data Distribution Styles – the details of which we will get into in the next few slides
Sort Keys – Interleaved, compound
Vacuum and Analyze
Constraints are defined but never imposed
Amazon Redshift is based on PostgreSQL 8.0.2
Tightly integrated with other AWS services including S3 and IAM
Initial preview beta was released on November 2012 and a full release was made available on February 15, 2013.
Continuous Development
You can use the COPY command to load data in parallel from an Amazon EMR cluster configured to write text files to the cluster's Hadoop Distributed File System (HDFS) in the form of fixed-width files, character-delimited files, CSV files, or JSON-formatted files.
Talk about COPY from s3, dynamo and ability to directly connect to master node using SQL client.
CPU Trade-off
You can use ZM to skip ranges that do not matter to your queries.
Node Size vCPU ECU RAM (GiB) Slices Per Node Storage Per Node Node Range Total Capacity
ds2.xlarge 4 13 31 2 2 TB HDD 1–32 64 TB
ds2.8xlarge 36 119 244 16 1 6 TB HDD 2–128 2 PB
Enables loading of data into VARCHAR columns even if the data contains invalid UTF-8 characters. When ACCEPTINVCHARS is specified, COPY replaces each invalid UTF-8 character with a string of equal length consisting of the character specified by replacement_char. For example, if the replacement character is '^', an invalid three-byte character will be replaced with '^^^'.
If no compression is specified in a CREATE TABLE or ALTER TABLE statement, Amazon Redshift automatically assigns compression encoding as follows:
Columns that are defined as sort keys are assigned RAW compression.
Columns that are defined as BOOLEAN, REAL, or DOUBLE PRECISION data types are assigned RAW compression.
All other columns are assigned LZO compression.
- Embedded SQL conversion
Cloud native code optimization
PL/SQL or T/SQl code converted to MySql/Aurora or Postgres Stored procs
Greenplum Database (version 4.3 and later)
Microsoft SQL Server (version 2008 and later)
Netezza (version 7.2 and later)
Oracle (version 11 and later)
Teradata (version 14 and later)
Vertica (version 7.2.2 and later)
You can set up a migration task within minutes in the AWS Management Console.
This includes setting up connections to the source and target databases, as well as choosing the replication instance used to run the migration process.
Srcs Oracle, SQL, MySQL, PostGres, Mongo, SAP ASE (Sybase) & Aurora
Targets Redshift + Dynamo + S3
6 TB cache limit (Repln Server)
IAM for Security – Fine grained AC using resource name and tags
Use KMS key to encrypt data in the replication instance
You can encrypt connections using SSL (You can use DMS to assign a certificate to the end point)
Oracle supplemental logging
MySQL row-level bin logging
SQL Server bulk logged/full recovery
Postgres WAL
No agent
Uses recovery log
Native change data capture API