Getting Maximum Performance from Amazon Redshift: Complex Queries

•Transferir como PPTX, PDF•

4 gostaram•33,295 visualizações

timonk

Slides from Timon Karnezos' talk at AWS re:Invent 2013, session DAT305.

Tecnologia Negócios

Getting Maximum Performance from Amazon
Redshift: Complex Queries
Timon Karnezos, Aggregate Knowledge
November 13, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Meet the new boss
Multi-touch Attribution

Same as the old boss
Behavioral Analytics

Same as the old boss
Market Basket Analysis

We know how to do this,
in SQL*!
* SQL:2003

Here it is.
SELECT record_date, user_id, action,
site, revenue,
SUM(1) OVER
(PARTITION BY user_id
ORDER BY record_date ASC)
AS position
FROM user_activities;

“Web Scale”
Queries






30 queries
1700 lines of SQL
20+ logical phases
GBs of output

Data
 ~109 daily impressions
 ~107 daily conversions
 ~104 daily sites
 x 90 days

per report.

So, how do we deliver
complex reports
over
“web scale” data?
(Pssst. The answer’s Redshift. Thanks AWS.)

Write (good) queries.
Organize the data.
Optimize for the humans.

Write (good) queries.
Remember: SQL is code.

Software engineering rigor
applies to SQL.
Factored.
Concise.
Tested.

-- Position in timeline
SUM(1) OVER (PARTITION BY user_id
ORDER BY record_date DESC ROWS UNBOUNDED PRECEDING)
-- Event count in timeline
SUM(1) OVER (PARTITION BY user_id
ORDER BY record_date DESC BETWEEN UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING)
-- Transition matrix of sites
LAG(site_name) OVER (PARTITION BY user_id
ORDER BY record_date DESC)
-- Unique sites in timeline, up to now
COUNT(DISTINCT site_name) OVER (PARTITION BY user_id
ORDER BY record_date DESC
ROWS UNBOUNDED PRECEDING)

Window functions
Scalable, combinable.
Compact but expressive.
Simple to reason about.

Leverage Redshift’s MPP roots.
Fast, columnar scans, IO.
Fast sort and load.
Effective when work is distributable.

Leverage Redshift’s MPP roots.
Sort into multiple representations.
Materialize shared views.
Hash-partition by user_id.

Operations should not be the bottleneck.
Develop without fear.
Trade time for money.
Scale with impunity.

Operations should not be the bottleneck.
Fast S3 = scratch space for cheap
Linear query scaling = GTM quicker
Dashboard Ops = dev/QA envs, marts, clusters
with just a click

Quantify and control costs
Test across different hardware, clusters.
Shut down clusters often.
Buy productivity, not bragging rights.

Thank you!
References
http://bit.ly/rs_ak
http://www.adweek.com/news/technology/study-facebook-leads-24-sales-boost-146716
http://en.wikipedia.org/wiki/Behavioral_analytics
http://en.wikipedia.org/wiki/Market_basket_analysis

Please give us your feedback on this
presentation

DAT305
As a thank you, we will select prize
winners daily for completed surveys!

Mais conteúdo relacionado

Mais procurados

AWS July Webinar Series: Amazon redshift migration and load data 20150722Amazon Web Services

Data Warehousing with Amazon RedshiftAmazon Web Services

Consolidate MySQL Shards Into Amazon Aurora Using AWS Database Migration Serv...Amazon Web Services

Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services

RedshiftPaulo Kieffer

Interactively Querying Large-scale Datasets on Amazon S3Amazon Web Services

Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Amazon Web Services

Getting Started with Amazon RedshiftAmazon Web Services

Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services

Getting Started with Amazon RedshiftAmazon Web Services

Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...Amazon Web Services

Deep Dive on Amazon DynamoDBAmazon Web Services

Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services

DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012Amazon Web Services

Building your data warehouse with RedshiftAmazon Web Services

Getting Started with Amazon RedshiftAmazon Web Services

Amazon Redshift Deep Dive Amazon Web Services

AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...Amazon Web Services

Deep Dive on Amazon AuroraAmazon Web Services

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013Amazon Web Services

Mais procurados (20)

AWS July Webinar Series: Amazon redshift migration and load data 20150722

Data Warehousing with Amazon Redshift

Consolidate MySQL Shards Into Amazon Aurora Using AWS Database Migration Serv...

Deep Dive Amazon Redshift for Big Data Analytics - September Webinar Series

Redshift

Interactively Querying Large-scale Datasets on Amazon S3

Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013

Getting Started with Amazon Redshift

Best Practices for Migrating your Data Warehouse to Amazon Redshift

Getting Started with Amazon Redshift

Best Practices for NoSQL Workloads on Amazon EC2 and Amazon EBS - February 20...

Deep Dive on Amazon DynamoDB

Leveraging Amazon Redshift for your Data Warehouse

DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012

Building your data warehouse with Redshift

Getting Started with Amazon Redshift

Amazon Redshift Deep Dive

AWS re:Invent 2016: Case Study: Librato's Experience Running Cassandra Using ...

Deep Dive on Amazon Aurora

Data Replication Options in AWS (ARC302) | AWS re:Invent 2013

Destaque

No BS Data Salon #3: Probabilistic Sketchingtimonk

Recent Developments in Spark MLlib and BeyondDataWorks Summit

Scala+dataSamir Bessalah

Abstract Algebra Cheat SheetMoe Han

Taking R Mainstream in Production SystemsMisha Lisovich

Big Data is changing abruptly, and where it is likely headingPaco Nathan

Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges" Dataconomy Media

HPE Vertica Chile Desayuno Oct 2016Analytics10

Uses and Best Practices for Amazon Redshift Amazon Web Services

API Management and KubernetesApigee | Google Cloud

Building Your Data Warehouse with Amazon RedshiftAmazon Web Services

Etsy Activity Feeds ArchitectureDan McKinley

Vertica And Spark: Connecting Computation And DataSpark Summit

Rapid Response: Debugging and Profiling to the RescueEric Kavanagh

Destaque (14)

No BS Data Salon #3: Probabilistic Sketching

Recent Developments in Spark MLlib and Beyond

Scala+data

Abstract Algebra Cheat Sheet

Taking R Mainstream in Production Systems

Big Data is changing abruptly, and where it is likely heading

Moustafa Soliman "HP Vertica- Solving Facebook Big Data challenges"

HPE Vertica Chile Desayuno Oct 2016

Uses and Best Practices for Amazon Redshift

API Management and Kubernetes

Building Your Data Warehouse with Amazon Redshift

Etsy Activity Feeds Architecture

Vertica And Spark: Connecting Computation And Data

Rapid Response: Debugging and Profiling to the Rescue

Semelhante a Getting Maximum Performance from Amazon Redshift: Complex Queries

Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Amazon Web Services

Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...Julien SIMON

Mr birenjan131

Big Data on the CloudSercan Karaoglu

Mastering MapReduce: MapReduce for Big Data Management and AnalysisTeradata Aster

Build A Website on AWS for Your First 10 Million UsersAmazon Web Services

ENT309 scaling up to your first 10 million usersAmazon Web Services

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network

Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Amazon Web Services

Search - Journey Of Delivery On A Budget (2014)Sam McLeod

Aplicaciones a gran escala: Cómo servir a millones de usuariosAmazon Web Services

ENT309 Scaling Up to Your First 10 Million UsersAmazon Web Services

DynamoDB Gluecon 2012Appirio

Gluecon 2012 - DynamoDBJeff Douglas

Klmug presentation - Simple Analytics with MongoDBRoss Affandy

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big DataAmazon Web Services

Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI

Getting Started with Amazon RedshiftAmazon Web Services

Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services

Semelhante a Getting Maximum Performance from Amazon Redshift: Complex Queries (20)

Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013

Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...

Mr bi

Big Data on the Cloud

Mastering MapReduce: MapReduce for Big Data Management and Analysis

Build A Website on AWS for Your First 10 Million Users

ENT309 scaling up to your first 10 million users

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...

Understanding AWS Database Options (DAT201) | AWS re:Invent 2013

Search - Journey Of Delivery On A Budget (2014)

Aplicaciones a gran escala: Cómo servir a millones de usuarios

ENT309 Scaling Up to Your First 10 Million Users

DynamoDB Gluecon 2012

Gluecon 2012 - DynamoDB

Klmug presentation - Simple Analytics with MongoDB

AWS Summit Tel Aviv - Startup Track - Data Analytics & Big Data

Migrating on premises workload to azure sql database

Getting Started with Amazon Redshift

Leveraging Amazon Redshift for Your Data Warehouse

Último

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

How to write a Business Continuity PlanDatabarracks

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

"ML in Production",Oleksandr BaganFwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Advanced Computer Architecture – An IntroductionDilum Bandara

Take control of your SAP testing with UiPath Test SuiteDianaGray10

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Getting Maximum Performance from Amazon Redshift: Complex Queries

1. Getting Maximum Performance from Amazon Redshift: Complex Queries Timon Karnezos, Aggregate Knowledge November 13, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

2. Meet the new boss Multi-touch Attribution

3. Same as the old boss Behavioral Analytics

4. Same as the old boss Market Basket Analysis

5. $

6. We know how to do this, in SQL*! * SQL:2003

7. Here it is. SELECT record_date, user_id, action, site, revenue, SUM(1) OVER (PARTITION BY user_id ORDER BY record_date ASC) AS position FROM user_activities;

8. So why is MTA hard?

9. “Web Scale” Queries     30 queries 1700 lines of SQL 20+ logical phases GBs of output Data  ~109 daily impressions  ~107 daily conversions  ~104 daily sites  x 90 days per report.

10. So, how do we deliver complex reports over “web scale” data? (Pssst. The answer’s Redshift. Thanks AWS.)

11. Write (good) queries. Organize the data. Optimize for the humans.

12. Write (good) queries. Remember: SQL is code.

13. Software engineering rigor applies to SQL. Factored. Concise. Tested.

14. Common Table Expression

15. Factored. Concise. Tested.

16. Window functions

17. -- Position in timeline SUM(1) OVER (PARTITION BY user_id ORDER BY record_date DESC ROWS UNBOUNDED PRECEDING) -- Event count in timeline SUM(1) OVER (PARTITION BY user_id ORDER BY record_date DESC BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) -- Transition matrix of sites LAG(site_name) OVER (PARTITION BY user_id ORDER BY record_date DESC) -- Unique sites in timeline, up to now COUNT(DISTINCT site_name) OVER (PARTITION BY user_id ORDER BY record_date DESC ROWS UNBOUNDED PRECEDING)

18. Window functions Scalable, combinable. Compact but expressive. Simple to reason about.

19. Organize the data.

20. Leverage Redshift’s MPP roots. Fast, columnar scans, IO. Fast sort and load. Effective when work is distributable.

21. Leverage Redshift’s MPP roots. Sort into multiple representations. Materialize shared views. Hash-partition by user_id.

22. Optimize for the humans.

23. Operations should not be the bottleneck. Develop without fear. Trade time for money. Scale with impunity.

24. Operations should not be the bottleneck. Fast S3 = scratch space for cheap Linear query scaling = GTM quicker Dashboard Ops = dev/QA envs, marts, clusters with just a click

25. But, be frugal.

26. Quantify and control costs Test across different hardware, clusters. Shut down clusters often. Buy productivity, not bragging rights.

27. Thank you! References http://bit.ly/rs_ak http://www.adweek.com/news/technology/study-facebook-leads-24-sales-boost-146716 http://en.wikipedia.org/wiki/Behavioral_analytics http://en.wikipedia.org/wiki/Market_basket_analysis

28. Please give us your feedback on this presentation DAT305 As a thank you, we will select prize winners daily for completed surveys!

Notas do Editor

I’m here to talk to you about how we write, run, and manage complex queries in Redshift at Aggregate Knowledge.To give you some background, our use of Redshift has been a substantial departure from how we generated reports in the past. Up until recently, all of our reports were generated through a streaming aggregation system.
But, generating one report in particular stumped us in a streaming setting.Multitouch attribution.What is it? It’s a way of answering the most important questions as an advertiser, which is how much should I pay for an ad?How do you answer the question: how much is a facebook ad worth? This is non-trivial.The way you answer this question is by looking at each user’s history up to a purchase and collecting statistics about the different ads the user has seen, and what order they’ve seen them in.
And those of you who feel like this is ringing a bell, you’re right, this is just the same behavioral analytics problem you see in any business. Take a user’s activity, look at it as a timeline, and try to understand what is preventing people from buying.
Market basket is a similar report in spirit.
And really they’re all addressing the same fundamental questions:Do we care that there are more or one type of activity or that the activity happened closer to the purchase?Do we care about the distribution of the different types of activity in time? In distinct type?Do we care that the activity was tightly clustered or evenly spread out?
And honestly, this is a pretty well understood problem in modern variants of SQL.
Here it is. You use a window function over the user’s ID and you’re done. Right?
So I know what the problem is and I know how to write the SQL to answer it, so why am I here?
Well, lots of data makes for lots of problems. And to boot the devil is in the details when you’re turning a query snippet into a product.
So how do we tame these crazy queries and data volumes?The answer is Redshift. We use Redshift to tackle these problems and we’re quite happy with how it’s treated us.So for the rest of the talk, I’m going to tell you about how we did that and what we’ve learned about these kinds of workloads.
What we found was that these three things were the critical parts of making Redshift work best for us.
First thing we had to do was turn the idea of a report like MTA into something executable and that meant writing some very complex queries.And what weighed heavily on us was that all the analytics SQL we’d seen in the past was a tangled mess, hundreds of lines long, with no documentation or testing.So we reminded ourselves that SQL is code.
And we need to apply our standard engineering rigor to our SQL. Redshift has a couple of features that make it very easy to deliver clean, logical, tested code.
The first is common table expressions. These allow me to factor my queries into digestible chunks that can be composed logically.
And to give you an idea of how helpful common table expressions can be I have one of our MTA queries here. It’s 500 lines long.With common table expressions, I can break this complex logic into smaller, reusable, testable chunks of about 8 to 12 lines.I can then take those pieces individually and debug them and test them, and once I’m confident in the correctness of each piece, I can layer them together to form the complex business logic I need.And this is simply applying that engineering rigor I mentioned before, to a place that was once like the wild west.
The second feature that allows us to manage the complexity of these queries is window functions.They allow us to express very advanced analytics in a concise and clear way.
In MTA we use window functions to answer questions likewhere in a user’s timeline a given ad impression fallshow many ads does a user see per sessionhow do users go from one site to another,And even more subtle analyses like “how many different sites did a user see before this site?”
And all of it in one or two lines a piece. This really makes our code much clearer and shorter.
So now we have the tools and SQL to express a complex report like MTA, but because of the scale of our data, we need to do some work to make it go as fast as possible.We found a few important techniques for organizing our data that allowed us to extract great performance from Redshift.
And simply, they are just playing to Redshift’s strengths, which are those of any MPP system.MTA has to join many billion row tables together, over multiple logical phases, which means that we need to make sure the data is laid out to facilitate that.
We have a couple of tricks that we use to do that, but the most useful one has been storing multiple representations of our data, each one sorted to optimize performance of a certain class of query.This has very little cost in Redshift because Redshift nodes come with a lot of storage, and they have excellent IO. The cost of loading data and materializing different sort orders is minimal compared to the performance improvements we see.
Now, clean queries that run fast are one thing, but building a product is another. In order to provide business value on top of the technical value, we need to optimize for the humans that write, debug, and ship these reports.
Redshift and the tooling around it gives us the ability to remove significant operational roadblocks that we would encounter with on-premises database vendors.
For instance, the complexity of MTA really showed us how important it is to have a very flexible operational environment.Specifically, the excellent bandwidth to and from S3 gave us the ability to materialize intermediate results from our queries and debug them without worrying about scratch space.We could also quickly and easily take snapshots before any major changes which really freed us up to take risks and experiment with different versions of queries and data layouts.In general, having great operational tools like the Redshift dashboard, plus the elastic scaling of well-integrated AWS services, allowed us to focus on developer productivity and getting a product out as fast as possible.
But just because you have all these easy tools to grow your use of redshift doesn’t mean that you can’t be frugal.
We used the ability to launch multiple clusters at once to benchmark and quantify the cost in dollars and seconds of our reports. This makes P&L and capacity planning much, much easier.Combine that with the fact that I can turn clusters on and off with minimal effort, and suddenly the lifecycles of all those clusters I launched are now manageable and can be tailored to their use case. QA clusters only need to be up during business hours. The dashboards are easy enough to manage that the devs that launch clusters are perfectly capable of shutting them down.And just a final point to wrap it up, Redshift’s stability, scalability, and reliability makes it so easy to quantify query performance in money and time that it gets you out of the habit of participating in the rat race of how many rows or how many nodes or how many jobs, and refocuses your thinking on“how much does it cost to deliver this report?”“how long will the customer have to wait?”“how do I enable my developers?”“how do I focus on minimizing technical debt?”It helps you focus on the business value you’re here to create.

Getting Maximum Performance from Amazon Redshift: Complex Queries

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (14)

Semelhante a Getting Maximum Performance from Amazon Redshift: Complex Queries

Semelhante a Getting Maximum Performance from Amazon Redshift: Complex Queries (20)

Último

Último (20)

Getting Maximum Performance from Amazon Redshift: Complex Queries

Notas do Editor