Get a technical deep dive into Amazon Redshift and Redshift Spectrum. Learn best practices for taking advantage of Amazon Redshift’s columnar technology and parallel processing capabilities, to improve overall database performance. This session will explain how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management and use Redshift Spectrum to query data directly in Amazon S3. The session will feature Jeff Battisti, Director Global Cloud BI&A Medical IT at Cardinal Health, and Greg Cantwell, Senior Consultant, Business Metrics / Analytics, who will provide lessons learned and best practices, from creating a new data warehouse to supporting Global Sales & Financial reporting in over 60 countries with Amazon Redshift.
2. Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
3. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester
Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a market and is plotted using a detailed
spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in
the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
The Forrester Wave™: Big Data Warehouse, Q2 2017
4. New! - Amazon Redshift Spectrum
Query directly against data in Amazon S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
Query data in-place using
open file formats
Full Amazon Redshift
SQL support
S3
SQL
5. Amazon Redshift is easy to use
Provisioning in
minutes
Automatic patching SQL - Data loading
Backups are built-in Security is built-in Compression is built-in
6. Amazon Redshift is easy to use
“With Amazon Redshift and Tableau, anyone in the
company can set up any queries they like - from how
users are reacting to a feature, to growth by demographic or
geography, to the impact sales efforts had in different areas”
“The doors were blown wide open to create custom
dashboards for anyone to instantly go in and see and
assess what is going in our ad delivery landscape,
something we have never been able to do until now.”
Provides an easy-to-use mechanism for querying data with
quick and uniform response times that analysts can use to
run research projects and perform in-depth analysis…We don’t
have to pre-allocate resources and can easily scale up to meet
demand and then scale down for efficiency”
7. Amazon Redshift is fast
“Did I mention that it’s ridiculously fast? We’re using
it to provide our analysts with an alternative to Hadoop”
“After investigating Redshift, Snowflake, and
BigQuery, we found that Redshift offers top-of-the-
line performance at best-in-market price points”
“…[Amazon Redshift] performance has blown away
everyone here. We generally see 50-100X speedup
over Hive”
“We regularly process multibillion row datasets
and we do that in a matter of hours. We are heading
to up to 10 times more data volumes in the next couple
of years, easily”
“We saw a 2X performance improvement on a wide
variety of workloads. The more complex the queries,
the higher the performance improvement”
“On our previous big data warehouse system, it took
around 45 minutes to run a query against a year of
data, but that number went down to just 25 seconds
using Amazon Redshift”
8. Amazon Redshift is cheap
“450,000 online queries 98 percent faster than previous
traditional data center, while reducing infrastructure costs by
80 percent.”
“Annual costs of Redshift are equivalent to just the annual
maintenance of some of the cheaper on-premises options
for data warehouses..”
“Most competing data warehousing solutions would have cost
us up to $1 million a year. By contrast, Amazon Redshift costs
us just $100,000 all-in, representing a total cost savings of
around 90%”
9. Amazon Redshift is secure
End-to-end
data encryption
Alerts &
notifications
Virtual private cloud
Audit logging
Certifications &
compliance
Encrypt S3 data using SSE and
AWS KMS
Encrypt all Amazon Redshift
data using KMS, AWS
CloudHSM or your on-premises
HSMs
Enforce SSL with perfect forward
encryption using ECDHE
Amazon Redshift leader node in
your VPC. Compute nodes in
private VPC. Spectrum nodes in
private VPC, store no state.
Communicate event-specific
notifications via email, text
message, or call with Amazon
SNS
All API calls are logged using
AWS CloudTrail
All SQL statements are logged
within Amazon Redshift
PCI/DSSFedRAMP
SOC1/2/3 HIPAA/BAA
10. Redshift is used for mission-critical workloads
Financial and
management reporting
Payments to suppliers
and billing workflows
Web/Mobile clickstream
and event analysis
Recommendation and
predictive analytics
11. Amazon Redshift has a large ecosystem
Data Integration Systems IntegratorsBusiness Intelligence
12. Accelerate migrations from legacy systems
“AWS Database Migration Service is the
most impressive migration service we’ve seen.”
Migrate – Over 1,000 unique migrations to Amazon Redshift
using AWS DMS
Amazon
Redshift
15. Let's build an analytic query - #1
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Let’s get the prior books she’s written.
1 Table
2 Filters
SELECT
P.ASIN,
P.TITLE
FROM
products P
WHERE
P.TITLE LIKE ‘%POTTER%’ AND
P.AUTHOR = ‘J. K. Rowling’
16. Let's build an analytic query - #2
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Let's compute the sales of the prior books she’s written in this
series and return the top 20 values
2 Tables (1 S3, 1 local)
2 Filters
1 Join
2 Group By columns
1 Order By
1 Limit
1 Aggregation
SELECT
P.ASIN,
P.TITLE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
products P
WHERE
D.ASIN = P.ASIN AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
GROUP BY P.ASIN, P.TITLE
ORDER BY SALES_sum DESC
LIMIT 20;
17. Let's build an analytic query - #3
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Let's compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions
3 Tables (1 S3, 2 local)
5 Filters
2 Joins
3 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
18. Let's build an analytic query - #4
An author is releasing the 8th book in her popular series. How
many should we order for Seattle? What were prior first few
day sales?
Let's compute the sales of the prior books she’s written in this
series and return the top 20 values, just for the first three days
of sales of first editions in the city of Seattle, WA, USA
4 Tables (1 S3, 3 local)
8 Filters
3 Joins
4 Group By columns
1 Order By
1 Limit
1 Aggregation
1 Function
2 Casts
SELECT
P.ASIN,
P.TITLE,
R.POSTAL_CODE,
P.RELEASE_DATE,
SUM(D.QUANTITY * D.OUR_PRICE) AS SALES_sum
FROM
s3.d_customer_order_item_details D,
asin_attributes A,
products P,
regions R
WHERE
D.ASIN = P.ASIN AND
P.ASIN = A.ASIN AND
D.REGION_ID = R.REGION_ID AND
A.EDITION LIKE '%FIRST%' AND
P.TITLE LIKE '%Potter%' AND
P.AUTHOR = 'J. K. Rowling' AND
R.COUNTRY_CODE = ‘US’ AND
R.CITY = ‘Seattle’ AND
R.STATE = ‘WA’ AND
D.ORDER_DAY :: DATE >= P.RELEASE_DATE AND
D.ORDER_DAY :: DATE < dateadd(day, 3, P.RELEASE_DATE)
GROUP BY P.ASIN, P.TITLE, R.POSTAL_CODE, P.RELEASE_DATE
ORDER BY SALES_sum DESC
LIMIT 20;
19. Now let’s run that query over an exabyte of data in S3
Roughly 140 TB of customer item order detail
records for each day over past 20 years.
190 million files across 15,000 partitions in S3.
One partition per day for USA and rest of world.
Need a billion-fold reduction in data processed.
Running this query using a 1000 node Hive cluster
would take over 5 years.*
• Compression ……………..….……..5X
• Columnar file format……….......…10X
• Scanning with 2500 nodes…....2500X
• Static partition elimination…............2X
• Dynamic partition elimination..….350X
• Redshift’s query optimizer……......40X
---------------------------------------------------
Total reduction……….…………3.5B X
* Estimated using 20 node Hive cluster & 1.4TB, assume linear
* Query used a 20 node DC1.8XLarge Amazon Redshift cluster
* Not actual sales data - generated for this demo based on data
format used by Amazon Retail.
22. Why we do it
For those tasked with
navigating the complexities of
healthcare, Cardinal Health
brings scaled solutions that help
you thrive in a changing world.
We will do it with tenacity, back
it with accountability, and the
understanding that we are, above
all, humble partners to those of
you on the front lines of healthcare.
We will be your wings.
25. Our Project - Provide a scalable solution for
Sales & Finance Reporting across 70 countries
within 12 months!
26. Traditional vs. Cloud Approach
2 Countries Per Year
6 Countries Per
Month
12 Teams & 20+
People
3 Teams & 6
People
6 weeks for
requirements
8 hours for
requirements
9 weeks for
deployment plus
extra time for scaling
Minutes to deploy &
scale
Solution is outdated
before complete
Solution is
constantly evolving
Enhancements are
expensive & slow
Enhancements are
low cost & fast
Traditional IT Estimate Cloud Approach
27. Success in the Cloud
Saving over $1 million annually
Reduced BI&A development time from 6 months to 6 weeks
Automated CI/CD Pipeline deployment & provisioning
Lead time to deliver data to business reduced from 48 hours to less
than 30 minutes
Reduction in staff onboarding time from > 30 days to 4 days
Automated 60% of the processes reducing overall risk by 70%
28. Lessons Learned
Success in the Cloud is
about People
Everyone Can Do
Every Job silos create
blockers
Self Sufficient Teams
drive productivity and
accountability
Automation means you
need to go slower to get
faster
DevOps is more about
how we work than
technology
Agile is more about
empowerment and team
ownership than processes
and procedures
Cloud Native
evolves and
automates faster
Cloud Geography
can drive bad design
Data Out Charges
are FUD ~$6 per
day
People Process Technology
29. One Time Data Load
Legacy
System
Storage
(SAP)
Files
(Exchange
Rate, Targets)
Sftp
Trans
Data
Database
End
Users
Super
Users
Master
Data
Tableau Server
Tableau
Desktop
Worldwide Reporting Design
Pilot Phase
• Minimal Viable Product with a lot of manual work
30. Our Journey
Pilot Phase
• 4 weeks to deliver 11 countries
• Co-located with end users in Switzerland
• Confirmed Design/Business rules on the fly
Tableau
Reporting
Amazon Redshift
Data Warehouse
S3
Storage
Manual
Upload
Manual
Copy
Custom
SQL ETL
31. One Time Data Load
Legacy
System
Storage
(SAP)
Files
(Exchange
Rate,
Targets)
Sftp
Trans
Data
Database
End
Users
Super
Users
Master
Data
Tableau Server
Tableau
Desktop
Worldwide Reporting Design
Execution Phase
ETL
• Automating and speeding delivery…..
33. One Time Data Load
Legacy
System
Storage
(SAP)
Files
(Exchange
Rate,
Targets)
Sftp
Trans
Data
Database
End
Users
Super
Users
Master
Data
Tableau Server
Tableau
Desktop
Worldwide Reporting Design
Refinement Phase
ETL
Automate
• Building resiliency, automation for rapid repetition