SlideShare uma empresa Scribd logo
1 de 23
Baixar para ler offline
Amazon Redshift Best Practices –
Part 1
April 2013
Vidhya Srinivasan & David Pearson
Agenda

• Introduction
• Redshift cluster architecture
• Best Practices for
     Data loading
     Key selection
     Querying
     WLM
• Q&A
AWS Database                                       Amazon Redshift
                                               Fast, Powerful, Fully Managed, Petabyte-Scale

Services                                                 Data Warehouse Service



                                                Amazon DynamoDB
  Scalable High Performance                 Fast, Predictable, Highly-Scalable NoSQL Data Store
Application Storage in the Cloud
                                                       Amazon RDS
     Deployment & Administration                 Managed Relational Database Service for
                                                    MySQL, Oracle and SQL Server

          Application Services
                                                Amazon ElastiCache
Compute        Storage           Database               In-Memory Caching Service



              Networking

      AWS Global Infrastructure
objectives
design and build a petabyte-scale data warehouse service


                                       A Lot Faster
Amazon
Redshift                               A Lot Cheaper


                                       A Whole Lot Simpler
Redshift Dramatically Reduces I/O

•   Direct-attached storage    Id            Age           State
                              123            20            CA
•   Large data block sizes    345            25            WA
•   Columnar storage          678            40             FL

•   Data compression
•   Zone maps

                               Row storage         Column storage
Redshift Runs on Optimized Hardware                                         Click to grow
                                                                                …to 1.6PB
HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate




HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage



                    •   Optimized for I/O intensive workloads
                    •   HS1.8XL available on Amazon EC2
                    •   Runs in HPC - fast network
                    •   High disk density
data generated



                                                                                                                           Gap   cost +
data volume




                                                                                                                                 effort

                                                                                                          data available
                                                                                                           for analysis



    Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
    IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Redshift is Priced to Analyze All Your Data

        $0.85 per hour for on-demand (2TB)
        $999 per TB per year (3-yr reservation)
Amazon Redshift Architecture
• Leader Node
      SQL endpoint
      Postgres based                                 JDBC/ODBC

      Stores metadata
      Communicates with client
      Compiles queries
                                            10 GigE
      Coordinates query execution          (HPC)


• Compute Nodes
    Local, columnar storage
    Execute queries in parallel - slices
    Load, backup, restore via Amazon S3                          Ingestion
                                                                  Backup
                                                                  Restore
• Everything is mirrored
Ingestion – Best Practices
•   Goal
        1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead

•   Best Practices
        Preferred method - COPY from S3
        Loads data in sorted order through the compute nodes
        Single Copy command, Split data into multiple files
        Strongly recommend that you gzip large datasets
         copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your-
         Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’;

•   If you must ingest through SQL
        Multi-row inserts                                      insert into category_stage values
        Avoid large number of singleton                        (default, default, default, default),
         insert/update/delete operations                        (20, default, 'Country', default),
•   To copy from another table                                  (21, 'Concerts', 'Rock', default);
        CREATE TABLE AS or INSERT INTO SELECT
Ingestion – Best Practices (Cont’d)
                                       select query, trim(filename), curtime, status
• Verifying load data files
                                       from stl_load_commits
     For US east – S3 provides        where filename like '%tickit%' order by query;
      eventual consistency
• Verify files are in S3           query |           btrim           |          curtime           | status
                                  -------+---------------------------+----------------------------+--------
• Listing Object Keys              22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 |       1
                                   22478 | tickit/venue_pipe.txt     | 2013-02-08 20:58:25.070604 |      1
• Query Redshift after             22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 |       1
                                   22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 |       1
  load. This query                 22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 |       1
                                   22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 |       1
  returns entries for              22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 |       1
  loading the tables in            22596 | tickit/venue_pipe.txt     | 2013-02-08 21:04:10.056055 |
                                   22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 |
                                                                                                         1
                                                                                                         1
  the TICKIT database…             22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 |       1
                                   22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 |      1
Ingestion – Best Practices (Cont’d)
•   Redshift does not currently support an upsert statement. Use staging tables to perform an
    upsert by doing a join on the staging table with the target – Update then Insert

•   Redshift does not currently enforce primary key constraint, if you COPY same data twice, it
    will be duplicated
•   Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count
     set wlm_query_slot_count to 3

•   Run the ANALYZE command whenever you’ve made a non-trivial number of changes to your
    data to ensure your table statistics are current

•   Amazon Redshift system table that can be helpful in troubleshooting data load
    issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust
    MAX ERRORS as needed.
•   Check character set : Support UTF8 up to 3 bytes long

•   View the console for errors
Console
Choose a Sort key

• Goal
   Skip over data blocks to minimize IO


• Best Practice
   Sort based on range or equality predicate (WHERE clause)
   If you access recent data frequently, sort based on TIMESTAMP
Choose a Distribution Key
• Goal
    Distribute data evenly across nodes
    Minimize data movement among nodes : Co-located Joins and Co-located Aggregates

• Best Practice
    Consider using Join key as distribution key (JOIN clause)
    If multiple joins, use the foreign key of the largest dimension as distribution key
    Consider using Group By column as distribution key (GROUP BY clause)

• Avoid
    Keys used as equality filter as your distribution key
        • If de-normalized tables and no aggregates, do not specify a distribution key -Redshift will
          use round robin
Distribution Key – Verify Data Skew

Check the data distribution
  select slice, col, num_values, minvalue, maxvalue
  from svv_diskusage where name='users' and col =0
  order by slice, col;

  slice| col | num_values | minvalue | maxvalue
  -----+-----+------------+----------+----------
  0    | 0   | 12496      | 4        | 49987
  1    | 0   | 12498      | 1        | 49988
  2    | 0   | 12497      | 2        | 49989
  3    | 0   | 12499      | 3        | 49990
Example

Select sum( S.Price * S.Quantity )
FROM SALES S
                                                       Dist key (C) = ProductID
JOIN CATEGORY C ON C.ProductId = S.ProductId           Dist key (S) = ProductID
JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId      Dist key (F) = FranchiseID
Where C.CategoryId = ‘Produce’ And F.State = ‘WA’
AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’          Sort key (S) = Date


     -- Total Produce sold in Washington in January 2013
Query Performance – Best Practices

• Encode date and time using “TIMESTAMP” data type instead of “CHAR”
• Specify Constraints
    Redshift does not enforce constraints (primary key, foreign key, unique values) but
     the optimizer uses it
    Loading and/or applications need to be aware
• Specify redundant predicate on the sort column
       SELECT * FROM tab1, tab2
       WHERE tab1.key = tab2.key
       AND tab1.timestamp > '1/1/2013'
       AND tab2.timestamp > '1/1/2013';

• WLM settings
Workload Manager
• Allows you to manage and adjust query concurrency
• WLM allows you to
      Increase query concurrency up to 15
      Define user groups and query groups
      Segregate short and long running queries
      Help improve performance of individual queries

• Be aware: query workload is distributed to every compute node
    Increasing concurrency may not always help due to resource contention
        • CPU, Memory and I/O
    Total throughput may increase by letting one query complete first and allowing
     other queries to wait
Workload Manager
• Default : 1 queue with a concurrency of 5
• Define up to 8 queues with a total concurrency of 15
• Redshift has a super user queue internally
Summary

• Avoid large number of singleton DML statements
  if possible
• Use COPY for uploading large datasets
• Choose Sort and Distribution keys with care
• Encode data and time with TIMESTAMP data type
• Experiment with WLM settings
More Information

              Best Practices for Designing Tables
http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html


                 Best Practices for Data Loading
http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html


                     View the Redshift Developer Guide at:
                 http://aws.amazon.com/documentation/redshift/
Questions?

Mais conteúdo relacionado

Mais de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mais de Amazon Web Services (20)

Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei server
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSight
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

AWS Webcast - Amazon Redshift Best Practices for Data Loading and Query Performance

  • 1. Amazon Redshift Best Practices – Part 1 April 2013 Vidhya Srinivasan & David Pearson
  • 2. Agenda • Introduction • Redshift cluster architecture • Best Practices for  Data loading  Key selection  Querying  WLM • Q&A
  • 3. AWS Database Amazon Redshift Fast, Powerful, Fully Managed, Petabyte-Scale Services Data Warehouse Service Amazon DynamoDB Scalable High Performance Fast, Predictable, Highly-Scalable NoSQL Data Store Application Storage in the Cloud Amazon RDS Deployment & Administration Managed Relational Database Service for MySQL, Oracle and SQL Server Application Services Amazon ElastiCache Compute Storage Database In-Memory Caching Service Networking AWS Global Infrastructure
  • 4. objectives design and build a petabyte-scale data warehouse service A Lot Faster Amazon Redshift A Lot Cheaper A Whole Lot Simpler
  • 5. Redshift Dramatically Reduces I/O • Direct-attached storage Id Age State 123 20 CA • Large data block sizes 345 25 WA • Columnar storage 678 40 FL • Data compression • Zone maps Row storage Column storage
  • 6. Redshift Runs on Optimized Hardware Click to grow …to 1.6PB HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage • Optimized for I/O intensive workloads • HS1.8XL available on Amazon EC2 • Runs in HPC - fast network • High disk density
  • 7. data generated Gap cost + data volume effort data available for analysis Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
  • 8. Redshift is Priced to Analyze All Your Data $0.85 per hour for on-demand (2TB) $999 per TB per year (3-yr reservation)
  • 9. Amazon Redshift Architecture • Leader Node  SQL endpoint  Postgres based JDBC/ODBC  Stores metadata  Communicates with client  Compiles queries 10 GigE  Coordinates query execution (HPC) • Compute Nodes  Local, columnar storage  Execute queries in parallel - slices  Load, backup, restore via Amazon S3 Ingestion Backup Restore • Everything is mirrored
  • 10. Ingestion – Best Practices • Goal  1 leader node & n compute nodes, Leverage all the compute nodes and minimize overhead • Best Practices  Preferred method - COPY from S3  Loads data in sorted order through the compute nodes  Single Copy command, Split data into multiple files  Strongly recommend that you gzip large datasets copy time from 's3://mybucket/data/timerows.gz’ credentials 'aws_access_key_id=<Your- Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-Key>’ gzip delimiter '|’; • If you must ingest through SQL  Multi-row inserts insert into category_stage values  Avoid large number of singleton (default, default, default, default), insert/update/delete operations (20, default, 'Country', default), • To copy from another table (21, 'Concerts', 'Rock', default);  CREATE TABLE AS or INSERT INTO SELECT
  • 11. Ingestion – Best Practices (Cont’d) select query, trim(filename), curtime, status • Verifying load data files from stl_load_commits  For US east – S3 provides where filename like '%tickit%' order by query; eventual consistency • Verify files are in S3 query | btrim | curtime | status -------+---------------------------+----------------------------+-------- • Listing Object Keys 22475 | tickit/allusers_pipe.txt | 2013-02-08 20:58:23.274186 | 1 22478 | tickit/venue_pipe.txt | 2013-02-08 20:58:25.070604 | 1 • Query Redshift after 22480 | tickit/category_pipe.txt | 2013-02-08 20:58:27.333472 | 1 22482 | tickit/date2008_pipe.txt | 2013-02-08 20:58:28.608305 | 1 load. This query 22485 | tickit/allevents_pipe.txt | 2013-02-08 20:58:29.99489 | 1 22487 | tickit/listings_pipe.txt | 2013-02-08 20:58:37.632939 | 1 returns entries for 22593 | tickit/allusers_pipe.txt | 2013-02-08 21:04:08.400491 | 1 loading the tables in 22596 | tickit/venue_pipe.txt | 2013-02-08 21:04:10.056055 | 22598 | tickit/category_pipe.txt | 2013-02-08 21:04:11.465049 | 1 1 the TICKIT database… 22600 | tickit/date2008_pipe.txt | 2013-02-08 21:04:12.461502 | 1 22603 | tickit/allevents_pipe.txt | 2013-02-08 21:04:14.785124 | 1
  • 12. Ingestion – Best Practices (Cont’d) • Redshift does not currently support an upsert statement. Use staging tables to perform an upsert by doing a join on the staging table with the target – Update then Insert • Redshift does not currently enforce primary key constraint, if you COPY same data twice, it will be duplicated • Increase the memory available to a COPY or VACUUM by increasing wlm_query_slot_count set wlm_query_slot_count to 3 • Run the ANALYZE command whenever you’ve made a non-trivial number of changes to your data to ensure your table statistics are current • Amazon Redshift system table that can be helpful in troubleshooting data load issues:STL_LOAD_ERRORS discovers the errors that occurred during specific loads. Adjust MAX ERRORS as needed. • Check character set : Support UTF8 up to 3 bytes long • View the console for errors
  • 14. Choose a Sort key • Goal  Skip over data blocks to minimize IO • Best Practice  Sort based on range or equality predicate (WHERE clause)  If you access recent data frequently, sort based on TIMESTAMP
  • 15. Choose a Distribution Key • Goal  Distribute data evenly across nodes  Minimize data movement among nodes : Co-located Joins and Co-located Aggregates • Best Practice  Consider using Join key as distribution key (JOIN clause)  If multiple joins, use the foreign key of the largest dimension as distribution key  Consider using Group By column as distribution key (GROUP BY clause) • Avoid  Keys used as equality filter as your distribution key • If de-normalized tables and no aggregates, do not specify a distribution key -Redshift will use round robin
  • 16. Distribution Key – Verify Data Skew Check the data distribution select slice, col, num_values, minvalue, maxvalue from svv_diskusage where name='users' and col =0 order by slice, col; slice| col | num_values | minvalue | maxvalue -----+-----+------------+----------+---------- 0 | 0 | 12496 | 4 | 49987 1 | 0 | 12498 | 1 | 49988 2 | 0 | 12497 | 2 | 49989 3 | 0 | 12499 | 3 | 49990
  • 17. Example Select sum( S.Price * S.Quantity ) FROM SALES S Dist key (C) = ProductID JOIN CATEGORY C ON C.ProductId = S.ProductId Dist key (S) = ProductID JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId Dist key (F) = FranchiseID Where C.CategoryId = ‘Produce’ And F.State = ‘WA’ AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’ Sort key (S) = Date -- Total Produce sold in Washington in January 2013
  • 18. Query Performance – Best Practices • Encode date and time using “TIMESTAMP” data type instead of “CHAR” • Specify Constraints  Redshift does not enforce constraints (primary key, foreign key, unique values) but the optimizer uses it  Loading and/or applications need to be aware • Specify redundant predicate on the sort column SELECT * FROM tab1, tab2 WHERE tab1.key = tab2.key AND tab1.timestamp > '1/1/2013' AND tab2.timestamp > '1/1/2013'; • WLM settings
  • 19. Workload Manager • Allows you to manage and adjust query concurrency • WLM allows you to  Increase query concurrency up to 15  Define user groups and query groups  Segregate short and long running queries  Help improve performance of individual queries • Be aware: query workload is distributed to every compute node  Increasing concurrency may not always help due to resource contention • CPU, Memory and I/O  Total throughput may increase by letting one query complete first and allowing other queries to wait
  • 20. Workload Manager • Default : 1 queue with a concurrency of 5 • Define up to 8 queues with a total concurrency of 15 • Redshift has a super user queue internally
  • 21. Summary • Avoid large number of singleton DML statements if possible • Use COPY for uploading large datasets • Choose Sort and Distribution keys with care • Encode data and time with TIMESTAMP data type • Experiment with WLM settings
  • 22. More Information Best Practices for Designing Tables http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html Best Practices for Data Loading http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html View the Redshift Developer Guide at: http://aws.amazon.com/documentation/redshift/