O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Deploying your Data Warehouse on AWS

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Próximos SlideShares
Data warehouse proposal
Data warehouse proposal
Carregando em…3
×

Confira estes a seguir

1 de 64 Anúncio

Deploying your Data Warehouse on AWS

Baixar para ler offline

Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.

Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter of minutes and starts to analyse your data right away using your existing business intelligence tools.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (20)

Semelhante a Deploying your Data Warehouse on AWS (20)

Anúncio

Mais de Amazon Web Services (20)

Deploying your Data Warehouse on AWS

  1. 1. © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Pratim Das 20th of September 2017 Deploying your Data Warehouse on AWS
  2. 2. Deploying your Data Warehouse on AWS AWS Lambda Glue and/or and/or ETL? Managed? Complex? Cost? Batch Firehose Glue S3 Streaming? DBs (OLTP)? Own code? Parallel? Managed? SCT Migration Agent DWH? S3 Query optimized & Ready for self- service Ingest Store Prepare for Analytics Store AnalyzeAthena Query Service Ad-hoc Analysis BI & Visualization Redshift Spectrum Redshift Spectrum DWH and Data Marts Predictive Redshift Data Warehouse Redshift Data Warehouse
  3. 3. Redshift Architecture
  4. 4. Managed Massively Parallel Petabyte Scale Data Warehouse Streaming Backup/Restore to S3 Load data from S3, DynamoDB and EMR Extensive Security Features Scale from 160 GB -> 2 PB Online Fast CompatibleSecure ElasticSimple Cost Efficient
  5. 5. Amazon Redshift Cluster Architecture Massively parallel, shared nothing Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, backup, restore • 2, 16 or 32 slices 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores S3 / EMR / DynamoDB / SSH JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node
  6. 6. Common Redshift Use Cases
  7. 7. Use Case: Traditional Data Warehousing Business Reporting Advanced pipelines and queries Secure and Compliant Easy Migration – Point & Click using AWS Database Migration Service Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant Large Ecosystem – Variety of cloud and on-premises BI and ETL tools Japanese Mobile Phone Provider Powering 100 marketplaces in 50 countries World’s Largest Children’s Book Publisher Bulk Loads and Updates
  8. 8. Use Case: Log Analysis Log & Machine IOT Data Clickstream Events Data Time-Series Data Cheap – Analyze large volumes of data cost-effectively Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics Interactive data analysis and recommendation engine Ride analytics for pricing and product development Ad prediction and on-demand analytics
  9. 9. Use Case: Business Applications Multi-Tenant BI Applications Back-end services Analytics as a Service Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can focus on your business applications Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several data marts Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline Infosys Information Platform (IIP) Analytics-as-a- Service Product and Consumer Analytics
  10. 10. Redshift Customers
  11. 11. Selected Amazon Redshift customers
  12. 12. NTT Docomo: Japan’s largest mobile service provider 68 million customers Tens of TBs per day of data across a mobile network 6 PB of total data (uncompressed) Data science for marketing operations, logistics, and so on Greenplum on-premises Scaling challenges Performance issues Need same level of security Need for a hybrid environment
  13. 13. 125 node DS2.8XL cluster 4,500 vCPUs, 30 TB RAM 2 PB compressed 10x faster analytic queries 50% reduction in time for new BI application deployment Significantly less operations overhead Data Source ET AWS Direct Connect Client Forwarder LoaderState Management SandboxAmazon Redshift S3 NTT Docomo: Japan’s largest mobile service provider
  14. 14. Nasdaq: powering 100 marketplaces in 50 countries Orders, quotes, trade executions, market “tick” data from 7 exchanges 7 billion rows/day Analyze market share, client activity, surveillance, billing, and so on Microsoft SQL Server on-premises Expensive legacy DW ($1.16 M/yr.) Limited capacity (1 yr. of data online) Needed lower TCO Must satisfy multiple security and regulatory requirements Similar performance
  15. 15. 23 node DS2.8XL cluster 828 vCPUs, 5 TB RAM 368 TB compressed 2.7 T rows, 900 B derived 8 tables with 100 B rows 7 man-month migration ¼ the cost, 2x storage, room to grow Faster performance, very secure Nasdaq: powering 100 marketplaces in 50 countries
  16. 16. Tuning Redshift for Performance
  17. 17. Design for Queryability • Equally on each slice • Minimum amount of work • Use just enough cluster resources
  18. 18. Do an Equal Amount of Work on Each Slice
  19. 19. Choose Best Table Distribution Style All Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 All data on every node Key Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Same key to same location Node 1 Slice 1 Slice 2 Node 2 Slice 3 Slice 4 Even Round robin distribution
  20. 20. Do the Minimum Amount of Work on Each Slice
  21. 21. Columnar storage + Large data block sizes + Data compression + Zone maps + Direct-attached storage analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw 10 | 13 | 14 | 26 |… … | 100 | 245 | 324 375 | 393 | 417… … 512 | 549 | 623 637 | 712 | 809 … … | 834 | 921 | 959 10 324 375 623 637 959 Reduced I/O = Enhanced Performance
  22. 22. Use Cluster Resources Efficiently to Complete as Quickly as Possible
  23. 23. Amazon Redshift Workload Management Waiting Workload Management BI tools SQL clients Analytics tools Client Running Queries: 80% memory ETL: 20% memory 4 Slots 2 Slots 80/4 = 20% per slot 20/2 = 10% per slot
  24. 24. Redshift Performance Tuning
  25. 25. Redshift Playbook Part 1: Preamble, Prerequisites, and Prioritization Part 2: Distribution Styles and Distribution Keys Part 3: Compound and Interleaved Sort Keys Part 4: Compression Encodings Part 5: Table Data Durability amzn.to/2quChdM
  26. 26. Optimizing Amazon Redshift by Using the AWS Schema Conversion Tool amzn.to/2sTYow1
  27. 27. Amazon Redshift Spectrum
  28. 28. Amazon Redshift Spectrum Run SQL queries directly against data in S3 using thousands of nodes Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query High concurrency: Multiple clusters access same data No ETL: Query data in-place using open file formats Full Amazon Redshift SQL support S3 SQL
  29. 29. Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 1
  30. 30. Query is optimized and compiled at the leader node. Determine what gets run locally and what goes to Amazon Redshift Spectrum Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 2
  31. 31. Query plan is sent to all compute nodes Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 3
  32. 32. Compute nodes obtain partition info from Data Catalog; dynamically prune partitions Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 4
  33. 33. Each compute node issues multiple requests to the Amazon Redshift Spectrum layer Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 5
  34. 34. Amazon Redshift Spectrum nodes scan your S3 data Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 6
  35. 35. 7 Amazon Redshift Spectrum projects, filters, joins and aggregates Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore
  36. 36. Final aggregations and joins with local Amazon Redshift tables done in-cluster Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 8
  37. 37. Result is sent back to client Life of a query Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 9
  38. 38. Demo: Running an analytic query over an exabyte in S3
  39. 39. Lets build an analytic query - #1 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets get the prior books she’s written. 1 Table 2 Filters SELECT P.ASIN, P.TITLE FROM products P WHERE P.TITLE LIKE ‘%POTTER%’ AND P.AUTHOR = ‘J. K. Rowling’
  40. 40. Lets build an analytic query - #2 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values 2 Tables (1 S3, 1 local) 2 Filters 1 Join 2 Group By columns 1 Order By 1 Limit 1 Aggregation SELECT P.ASIN, P.TITLE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, products P WHERE D.ASIN = P.ASIN AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND GROUP BY P.ASIN, P.TITLE ORDER BY SALES_sum DESC LIMIT 20;
  41. 41. Lets build an analytic query - #3 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions 3 Tables (1 S3, 2 local) 5 Filters 2 Joins 3 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  42. 42. Lets build an analytic query - #4 An author is releasing the 8th book in her popular series. How many should we order for Seattle? What were prior first few day sales? Lets compute the sales of the prior books she’s written in this series and return the top 20 values, just for the first three days of sales of first editions in the city of Seattle, WA, USA 4 Tables (1 S3, 3 local) 8 Filters 3 Joins 4 Group By columns 1 Order By 1 Limit 1 Aggregation 1 Function 2 Casts SELECT P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE, SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum FROM s3.d_customer_order_item_details D, asin_attributes A, products P, regions R WHERE D.ASIN = P.ASIN AND P.ASIN = A.ASIN AND D.REGION_ID = R.REGION_ID AND A.EDITION LIKE '%FIRST%' AND P.TITLE LIKE '%Potter%' AND P.AUTHOR = 'J. K. Rowling' AND R.COUNTRY_CODE = ‘US’ AND R.CITY = ‘Seattle’ AND R.STATE = ‘WA’ AND D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE) GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE ORDER BY SALES_sum DESC LIMIT 20;
  43. 43. Now let’s run that query over an exabyte of data in S3 Roughly 140 TB of customer item order detail records for each day over past 20 years. 190 million files across 15,000 partitions in S3. One partition per day for USA and rest of world. Need a billion-fold reduction in data processed. Running this query using a 1000 node Hive cluster would take over 5 years.* • Compression ……………..….……..5X • Columnar file format……….......…10X • Scanning with 2500 nodes…....2500X • Static partition elimination…............2X • Dynamic partition elimination..….350X • Redshift’s query optimizer……......40X --------------------------------------------------- Total reduction……….…………3.5B X * Estimated using 20 node Hive cluster & 1.4TB, assume linear * Query used a 20 node DC1.8XLarge Amazon Redshift cluster * Not actual sales data - generated for this demo based on data format used by Amazon Retail.
  44. 44. Is Amazon Redshift Spectrum useful if I don’t have an exabyte? Your data will get bigger On average, data warehousing volumes grow 10x every 5 years The average Amazon Redshift customer doubles data each year Amazon Redshift Spectrum makes data analysis simpler Access your data without ETL pipelines Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake Amazon Redshift Spectrum improves availability and concurrency Run multiple Amazon Redshift clusters against common data Isolate jobs with tight SLAs from ad hoc analysis
  45. 45. Ingestion, ETL & BI
  46. 46. Getting data to Redshift using AWS Database Migration Service (DMS) Simple to use Minimal Downtime Supports most widely used Databases Low Cost Fast & Easy to Set-up Reliable Source=OLTP
  47. 47. Getting data to Redshift using AWS Schema Conversion Tool (SCT) Source=OLAP
  48. 48. Loading data from S3 • Splitting Your Data into Multiple Files • Uploading Files to Amazon S3 • Using the COPY Command to Load from Amazon S3
  49. 49. ETL on Redshift
  50. 50. Upsert into Amazon Redshift using AWS Glue and SneaQL http://amzn.to/2woj3gB
  51. 51. Partner ETL solutions Data integration
  52. 52. BI and Visualization
  53. 53. QuickSight for BI on Redshift Amazon Redshift
  54. 54. Partner BI solutions
  55. 55. Advanced Analytics • Approximate functions • User defined functions • Machine learning • Data science
  56. 56. Redshift Partner Echo System
  57. 57. 4 types of partners • Load and transform your data with Data Integration Partners • Analyze data and share insights across your organization with Business Intelligence Partners • Architect and implement your analytics platform with System Integration and Consulting Partners • Query, explore and model your data using tools and utilities from Query and Data Modeling Partners aws.amazon.com/redshift/partners/
  58. 58. AWS Named as a Leader in The Forrester WaveTM: Big Data Warehouse Q2 2017 http://bit.ly/2w1TAEy On June 15, Forrester published the Big Data Warehouse, Q2 2017, in which AWS is positioned as a Leader. According to Forrester, “With more than 5,000 deployments, Amazon Redshift has the largest data warehouse deployments in the cloud.” AWS received the highest score possible, 5/5, for customer base, market awareness, ability to execute, road map, support, and partners. “AWS’s key strengths lie in its dynamic scale, automated administration, flexibility of database offerings, good security, and high availability (HA) capabilities, which make it a preferred choice for customers.
  59. 59. Migrations
  60. 60. Extending your DWH (or Migrations) to Redshift http://amzn.to/2vN3UBO Oracle to Redshift
  61. 61. Extending your DWH (or Migrations) to Redshift http://amzn.to/2wZy7OA Teradata to Redshift
  62. 62. Extending your DWH (or Migrations) to Redshift http://amzn.to/2hbKwYd Converge Silos to Redshift
  63. 63. London Redshift Meetup http://bit.ly/2x6ZdGA Next Session on the 26th of October. RSVP opens 6th of October
  64. 64. Questions?

Notas do Editor

  • We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  • You can start small and grow without interrupting query size. It’s cost efficient, Smallest environment is single node 160GB cluster with SSD Storage for $.025/hour | Scale up to largest cluster (NTT Docomo) with 4PB of usable data in Redshift. It’s Simple > SQL, JDBC/ODBC connectivity to your favorite BI tool.
    Fast > We optimized for scans against TB/PB of data set, by leveraging columnar column storage, compression within a shared nothing massively parallel architecture.
    Secure > Security is built-in. You can encrypt data at rest and in transit, isolate your clusters using Amazon VPC and even manage your keys using AWS Key Management Service (KMS) and hardware security modules (HSMs).
  • Let’s take a closer look at the Redshift architecture. The bit that you work with is what we call the leader node. This is where you connect to using your driver, you can use JDBC/ODBC. In addition to Redshift drives which you can download from our site, you can also connect technically with any Postgres driver. Behind the leader node is the compute nodes, can have up to 128 of those. Next, these compute nodes also talk to other services, primarily S3. We ingest data usually from as S3 and you can unload data to S3. We are continuously backing up your cluster to S3, all of this happens in the background and in parallel and that's really kind of the important takeaway on this slide. As we receive your query we generate C++ code that are compiled and we send those down all the compute nodes, that work is also done by the leader node. Postgres catalog tables also exist in the leader. We've also added to additional metadata tables to Redshift as well.
  • We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  • Fully Managed – We provision, backup, patch and monitor so you can focus on your data
    Fast – Massively Parallel Processing and columnar architecture for fast queries and parallel loads

    Nasdaq security – Ingests 5B rows/trading day, analyzes orders, trades and quotes to protect traders, report activity and develop their marketplaces

    NTT Docomo - Redshift is NTT Docomo's primary analytics platform for data science and marketing and logistic analytics. Data is pre-processed on premises and loaded into a massive, multi-petabyte data warehouse on Amazon Redshift, which data scientists use as their primary analytics platform.


  • Pinterest uses Redshift for interactive data analysis. Redshift is used to store all web event data and uses for KPIs, recommendations and A/B experimentation.

    Lyft uses Redshift for ride analytics across the world (rides / location data ) - Through analysis, company engineers estimated that up to 90% of rides during peak times had similar routes. This led to the introduction of Lyft Line – a service that allows customers to save up to 60% by carpooling with others who are going in the same direction.

    Yelp has multiple deployments of RedShift with different data sets in use by product management, sales analytics, ads, SeatMe (Point of sale analytics) and many other teams.
    Analyzes 0s of millions of ads/day, 250M mobile events/day, ad campaign performance and new feature usage
  • Accenture Insights Platform (AIP) is a scalable, on-demand, globally available analytics solution running on Amazon Redshift. AIP is Accenture's foundation for its big data offering to deliver analytics applications for healthcare and financial services.
  • We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  • USE CASES – to use across the next several slides (6-10)
    Amazon –Understand customer behavior, migrated from Oracle, PBs workload, 2TB/day@67% YoY. Could query across 1 week in one hour with Oracle, now can query 15 months in 14 min with Redshift
    Boingo – 2000+ Commercial Wi-Fi locations, 1 million+ Hotspots, 90M+ ad engagements in 100+ countries. Used Oracle, Rapid data growth slowed analytics, Admin overhead and Expensive (license, h/w, support). After migration 180x performance improvement and 7x cost savings
    Finra (Financial Industry Regulatory Authority) – One of the largest independent securities regulators in the United States, was established to monitor and regulate financial trading practices. Reacts to changing market dynamics while providing its analysts with the tools (Redshift) to interactively query multi-petabyte data sets. Captures, analyzes, and stores a daily ~75 billion records daily. The company estimates it will save up to $20 million annually by using AWS instead of a physical data center infrastructure.
    Desk.com – Ingests 200K case related events/hour and runs a user facing portal on Redshift
  • We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  • In Redshift world throughput is not about concurrency but effective use of MPP architecture. For the next few slides, lets see how we do that…

    Don't skew the work to a few slices – choose the right distribution key.
    Minimum - don't pull more blocks off disk than you absolutely have to – Use sort keys to eliminate IO.
    Assign just enough memory, which is the only resource you control, to the query - any more and you're wasting memory that could've used by other queries –WLM.
  • How do we Do an Equal Amount of Work on Each Slice? Slices is a very important concept in Redshift - it's how we achieve parallelism within a node on the cluster. We've taken each compute node and we've divided up into what we call slices – each with it’s own share of compute and storage. Depending on the node type, there is 2, 16 or 32 of these slices within each compute node.
  • This nicely leads to how do we distribute data into these slices.
    KEY (Distribute on column participating in largest join. We run a modular operator and that figures out what slice the data will land on.). Use in your Large fact tables, billions and trillions of rows & largest dimension table. Consider using in columns appearing in your GROUP BYs.
    Next is, EVEN, here we'll just round robin the data through the cluster will make it even, it’s the default – if you are note sure what to do, this is the safes to start with. Avoid using it on columns, if you are performing aggregations on them.
    ALL: This is ideal for Small (~5M rows) dimension tables
  • Now, lets talk about how do we Do the Minimum Amount of Work on Each Slice
  • Redshift uses a block size of 1 MB, which is more efficient and further reduces the number of I/O requests needed to perform any database loading or queries. Because it columnar, it’s stored column by column, all of the same data type and hence much easy to compress. And there are different compression encoding types based on the data type you have. ZSTD early this year. All data types. Great compression – great for CHAR, VARCHAR, JSON strings.
    Redshift also has an in-memory data structure, that contains the MIN/MAX of each of the blocks. That allows you to effectively prune out data when you are doing a predicate and skip blocks that don’t have data for your query results.
    This is what Redshift will do it for you, then use sort keys to further reduce IO.

  • Let’s talk about Queues and how you effectively manage the resources in your cluster.
  • You can use workload management (WLM) to define multiple queues and to route queries at runtime. This is important when you have multiple sessions or users running queries at the same time, each with different profiles. You can configure up to 8 query queues and set the number of queries that can run in each of those queues concurrently. You can set rules to route queries to particular queues based on the users or labels/i.e. query groups. For examples, you can have a rules that says: Any Long running query with high I/O skew in this queue where Segment execution time (seconds) > 120 and I/O skew (ratio) > 2 – hop to another queue, or log it or abort it. You can configure the amount of memory allocated to each queue, so that large queries run in queues with more memory than other queues. You can also configure the WLM timeout property to limit long-running queries.
  • Take a note on this. This is a 5 step guide to do Redshift performance tuning.
  • You can use the AWS Schema Conversion Tool (AWS SCT) to optimize Redshift. You can choose to Collect Statistics. AWS SCT uses the statistics to make suggestions for sort keys and distribution keys. And then, you can choose to Run Optimization. To review the suggestions, table by table.
    You can also create a report in PDF/CSV that contains the optimization suggestions. You can right click and choose Apply to database. To apply the recommendations.
    SCT is also your friend when you are planning to migrate another database or a data warehouse to Redshift. SCT supports: Oracle Data Warehouse, Microsoft SQL Server, Teradata, Netezza, Greenplum, Vertica… to Redshift. It will not only do an assements but will also migrate shemas and other objects to Redshift (DDLs), not the data – for data we have DMS that I will cover shortly.

  • Let’s take a quick look at Spectrum
  • Spectrum gives you the ability to use Redshift SQL queries directly against your S3 data using thousands of nodes. It's fast even at exabytes scale it's elastic and highly available. You pay per query. It's highly concurrent so you can have multiple Redshift clusters access the same data inside S3 using spectrum. There is no E.T.L. you can query the data in place in S3 using open data format without conversion. And it's ANSI SQL. No differences whatsoever with Redshift.
  • This is similar to what we saw on the previous slide on Redshift architecture. Only difference is, we now have a spectrum cluster which are a bunch of private VPC managed nodes that are available to you. S3 contains the data you also need the meta-data to understand that data and whether that's the Athena catalog today or to the glue data catalog when it launches in a little while or it's your own Hive metastore if you happen to have one. Now, let’s trace a journey of a query in Redshift cluster using Spectrum. You start by submitting your query via your client…
  • The query optimized at the leader node and compiled there into C++ and we determine what gets run locally and what needs to go to Spectrum.
  • The code is sent to all of the compute notes to execute.
  • The compute nodes figure out the partition information from the data catalog and they're able to dynamically prune the partitions based on the parts of the plan that they've run already.
  • And they issues multiple requests to spectrum to grab the files.
  • Spectrum is going to scan your S3 data.
  • It runs the projections filters aggregates.
  • And then the final aggregations, GROUP BYs and so forth get pulled back into Redshift where we do join processing. Because you want to do join processing on smaller amounts of data.
  • And finally, results are returned to the client.
  • Let's imagine that J.K. Rowling is about to issue her eighth book in the Harry Potter series. so she sells a lot of books and so I might be a product manager thinking about hey so how many books should I order for the region I'm responsible for.
    So I'm going to pick out the books from my products table where the title contains the word Potter and the author is J.K. Rowling. So one table two filters.
  • Then you want to compute the sales of those books, So now suddenly you've got two tables, one in S3 which is because the customer_orders_Item_Details could be pretty large even if the product is small. There are two filters, one join, a couple of group by’s, one order by a limit and in the aggregate.
  • The next thing you want to be able to say is hey I'm not interested in her total book sales I just want the first three days because that's my reorder point, and I don't want the book so for all time I just want the hardcover first edition.
    So now we're up to three tables, five filters, two joins. three group by’s, one order by, a limit and aggregate a function in a couple of casts. So it's starting to get complicated right.
  • And then you would say like am I actually just interested in the city of Seattle and so I pick up a regions table and I'll say like OK here's the country or the state and here's the region and I want to basically join against the region.
  • Let’s assume you are a bookstore at Amazon scale, you generate maybe roughly one hundred forty terabytes of customer_item_order_detail every day we might have been saving that for the last twenty years. That’s 190 million files across fifteen thousand partitions, basically a partition per day and then there's a partition for US vs rest of the world…

    You really need to get a billion fold reduction in the amount of data that you're processing in order to make that work and you have it returned in a reasonable time. We estimated, running this query using a 1000 node Hive cluster would take over 5 years.

    For this particular data set we get about 5X. compression because most of the data in that is text. There are 50 columns but we're retrieving about five from S3, so we get a 10X. compression there so that's fifty total. In this case spectrum decided to run 2500 nodes, so that gives me another bit of advantage there. But I'm still quite a bit short of the billion that I need barely into the million range. And so if you want to scale up you can't just throw bigger hammers at things and somewhere along the way you actually have to learn to be smart about what we apply and so using redshift you know there's a 2X static partition elimination that happens because I'm actually asking for data about the US, and so I can remove half of it came back just under three minutes here.

    I really think she should do another book because as you can see that the last one is the one that shows up in the top ten all the time.
  • one reason is that your data is going to get bigger. on average data warehousing volumes grow 10X every 5 years so it'll go up a factor of a thousand every 15 years and that's industry wide. On average, redshift customer doubles their storage every year. Just because the costs are better
  • Now, let’s take a closer look at AWS native services that are key to data warehousing workloads.
  • Why DMS, it is simple, you can set up and get going in minutes, you can migrate databases with little downtime, supports many sources, free – you only pay for the underlying EC instance. And using DMS, you can move data either directly to Redshift from a number of source database or to a data lake on S3. Once the data is in S3, you have sperated storage from compute, you can then consume by multiple Redshift clusters via Spectrum, or via Athena or EMR. You can have many architectural patterns, such as 1 to 1, or bringing many data silo’s to Redshift for it’s inexpensive petabyte scale analytics capabilities and then the ability to query/join them all together in Redshift to gain further insights. From source you can do one-off load or CDC.
  • The COPY command leverages Redshift’s massively parallel processing (MPP) architecture to read and load data in parallel from files in an S3 bucket. You can take maximum advantage of parallel processing by splitting your data into multiple files and by setting distribution keys on your tables.
  • AWS Glue is a fully managed ETL service that makes it easy to move data between your data stores. AWS Glue is integrated with Amazon S3, Amazon RDS, and Amazon Redshift, and can connect to any JDBC-compliant data store. AWS Glue automatically crawls your data sources, identifies data formats, and then suggests schemas and transformations, so you don’t have to spend time hand-coding data flows. You can then edit these transformations, if necessary, using the tools and technologies you already know, such as Python, Spark, Git and your favorite integrated developer environment (IDE), and share them with other AWS Glue users. Glue schedules your ETL jobs and provisions and scales all the infrastructure required so your ETL jobs run quickly and efficiently at any scale. There are no servers to manage, and you pay only for resources consumed by your ETL jobs.

  • We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/
  • Amazon QuickSight is a fast, cloud-powered business analytics service that makes it easy to build visualizations, perform ad-hoc analysis, and quickly get business insights from your data. QuickSight natively connects to Redshift. You can also leverage QuickSight’s Super-fast, Parallel, In-memory, Calculation Engine (SPICE) that uses a combination of columnar storage, in-memory technologies. QuickSight now also support Redshift Spectrum tables, which we will talk about soon. So now you have a fast BI engine on top of an exabyte scale data warehouse.
  • Python 2.7 – do you need support for other languages? Let us know
  • For Redshift we have a list of certified partners and their solutions to work with Amazon Redshift. To search all AWS technology partners, visit the Partner Solutions Finder. Or use the AWS Marketplace.
  • Essentially, there are four types of partners: Informatica, matillion, Talend
    Looker, Tableau, Qlik
    From Accenture, Beeva, Northbay and Uturn… and you will also hear from Musoor who is the Director of Data Science at Crimson Macaw on how they implemented Machester Airport’s data warehouse using Redshift.
    Agnitty, Aqua Data Studio, RazorSQL and Toad

  • We launched Redshift Valentine's Day February 2013 so it's been out for a little over four years now and since that time we've innovated like crazy. We've been releasing patches on a regular cadence, usually every two weeks. https://aws.amazon.com/redshift/whats-new/

×