Data warehousing is a critical component for analysing and extracting actionable insights from your data. Amazon Redshift allows you to deploy a scalable data warehouse in a matter minutes and start to analyse your data right away using your existing business intelligence tools. It’s a fast, fully-managed, and cost-effective data warehousing system. You can analyse all your data using standard SQL and your existing Business Intelligence (BI) tools. Amazon Redshift also includes Redshift Spectrum, allowing you to directly run SQL queries against exabytes of unstructured data in Amazon S3. In this, you will learn how to migrate from existing data warehouses, optimise schemas and load data efficiently. We will also cover analytics tools to help you build visualisations, perform ad-hoc analysis and quickly get business insights from your data.
Learning Objectives:
• Discover best practices for building a data warehouse using Amazon Redshift
• Learn to use Amazon QuickSight for Business Intelligence and AWS Glue for ETL.
3. Petabyte scale; massively parallel
Relational data warehouse
Fully managed; zero admin
SSD & HDD platforms
As low as $1,000/TB/Year
Amazon
Redshift
4. Amazon Redshift Spectrum
Run SQL queries directly against data in S3 using thousands of nodes
Fast @ exabyte scale Elastic & highly available On-demand, pay-per-query
High concurrency: Multiple
clusters access same data
No ETL: Query data in-place
using open file formats
Full Amazon Redshift
SQL support
S3
SQL
6. Use Case: Traditional Data Warehousing
Business
Reporting
Advanced pipelines
and queries
Secure and
Compliant
Easy Migration – Point & Click using AWS Database Migration Service
Secure & Compliant – End-to-End Encryption. SOC 1/2/3, PCI-DSS, HIPAA and FedRAMP compliant
Large Ecosystem – Variety of cloud and on-premises BI and ETL tools
Japanese Mobile
Phone Provider
Powering 100 marketplaces
in 50 countries
World’s Largest Children’s
Book Publisher
Bulk Loads
and Updates
7. Use Case: Log Analysis
Log & Machine
IOT Data
Clickstream
Events Data
Time-Series
Data
Cheap – Analyze large volumes of data cost-effectively
Fast – Massively Parallel Processing (MPP) and columnar architecture for fast queries and parallel loads
Near real-time – Micro-batch loading and Amazon Kinesis Firehose for near-real time analytics
Interactive data analysis and
recommendation engine
Ride analytics for pricing
and product development
Ad prediction and
on-demand analytics
8. Use Case: Business Applications
Multi-Tenant BI
Applications
Back-end
services
Analytics as a
Service
Fully Managed – Provisioning, backups, upgrades, security, compression all come built-in so you can
focus on your business applications
Ease of Chargeback – Pay as you go, add clusters as needed. A few big common clusters, several
data marts
Service Oriented Architecture – Integrated with other AWS services. Easy to plug into your pipeline
Infosys Information
Platform (IIP)
Analytics-as-a-
Service
Product and Consumer
Analytics
9. AWS Named as a Leader in The Forrester
WaveTM: Big Data Warehouse Q2 2017
http://bit.ly/2w1TAEy
On June 15, Forrester published the Big Data
Warehouse, Q2 2017, in which AWS is
positioned as a Leader. According to Forrester,
“With more than 5,000 deployments, Amazon
Redshift has the largest data warehouse
deployments in the cloud.” AWS received the
highest score possible, 5/5, for customer base,
market awareness, ability to execute, road map,
support, and partners. “AWS’s key strengths lie
in its dynamic scale, automated administration,
flexibility of database offerings, good security,
and high availability (HA) capabilities, which
make it a preferred choice for customers.
12. NTT Docomo: Japan’s largest mobile service provider
68 million customers
Tens of TBs per day of data across a
mobile network
6 PB of total data (uncompressed)
Data science for marketing
operations, logistics, and so on
Greenplum on-premises
Scaling challenges
Performance issues
Need same level of security
Need for a hybrid environment
13. 125 node DS2.8XL cluster
4,500 vCPUs, 30 TB RAM
2 PB compressed
10x faster analytic queries
50% reduction in time for new
BI application deployment
Significantly less operations
overhead
Data
Source
ET
AWS
Direct
Connect
Client
Forwarder
LoaderState
Management
SandboxAmazon Redshift
S3
NTT Docomo: Japan’s largest mobile service provider
14. Nasdaq: powering 100 marketplaces in 50 countries
Orders, quotes, trade executions,
market “tick” data from 7 exchanges
7 billion rows/day
Analyze market share, client activity,
surveillance, billing, and so on
Microsoft SQL Server on-premises
Expensive legacy DW
($1.16 M/yr.)
Limited capacity (1 yr. of data
online)
Needed lower TCO
Must satisfy multiple security
and regulatory requirements
Similar performance
15. 23 node DS2.8XL cluster
828 vCPUs, 5 TB RAM
368 TB compressed
2.7 T rows, 900 B derived
8 tables with 100 B rows
7 month migration
¼ the cost, 2x storage, room to
grow
Faster performance, very
secure
Nasdaq: powering 100 marketplaces in 50 countries
19. When should you add Spectrum?
Your data will get bigger
• On average, data warehousing volumes grow 10x every 5 years
• The average Amazon Redshift customer doubles data each year
Amazon Redshift Spectrum makes data analysis simpler
• Access your data without ETL pipelines
• Teams using Amazon EMR, Athena & Redshift can collaborate using the same data lake
• Late binding views enable federated queries between internal & external tables
Amazon Redshift Spectrum improves availability and concurrency
• Run multiple Amazon Redshift clusters against common data
• Isolate jobs with tight SLAs from ad hoc analysis
21. Deploying your Data
Warehouse on AWS
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
Prepare for
Analytics PresentStore AnalyzeStore
22. Deploying your Data
Warehouse on AWS
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
Present
Prepare for
Analytics AnalyzeStore
23. Deploying your Data
Warehouse on AWS
AWS Lambda Glue
and/or
and/or
ETL? Managed?
Complex? Cost?
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
PresentAnalyzeStore
24. Deploying your Data
Warehouse on AWS
AWS Lambda Glue
and/or
and/or
ETL? Managed?
Complex? Cost?
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
S3
Query optimized
&
Ready for self-
service
PresentAnalyze
25. Deploying your Data
Warehouse on AWS
AWS Lambda Glue
and/or
and/or
ETL? Managed?
Complex? Cost?
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
S3 Athena
Query Service
Ad-hoc Analysis
Redshift Spectrum Redshift Spectrum
DWH and Data Marts
Redshift
Data Warehouse
Redshift
Data Warehouse
Present
Predictive
Query optimized
&
Ready for self-
service
26. Deploying your Data
Warehouse on AWS
AWS Lambda Glue
and/or
and/or
ETL? Managed?
Complex? Cost?
Batch
Firehose
Glue
S3
Streaming?
DBs (OLTP)?
Own code?
Parallel?
Managed?
SCT Migration
Agent
DWH?
S3 Athena
Query Service
Ad-hoc Analysis BI & Visualization
Redshift Spectrum Redshift Spectrum
DWH and Data Marts
Redshift
Data Warehouse
Redshift
Data Warehouse
Predictive
Query optimized
&
Ready for self-
service
29. Easy exploration of AWS data
Securely discover and connect to AWS data
Quickly explore AWS data sources
Relational databases (Amazon RDS, Amazon RDS for
Aurora,
Amazon Redshift)
NoSQL databases (Amazon DynamoDB)
Amazon EMR, Amazon S3, files (CSV, Excel, TSV,
XLF, CLF)
Streaming data sources (Amazon DynamoDB, Amazon
Kinesis)
Easily import data from any table or file
Automatic detection of data types
30. Business User
QuickSight API
Data Prep Metadata SuggestionsConnectors SPICE
Business User
QuickSight UI
Mobile Devices Web Browsers
Partner BI products
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
EMR
Amazon
Redshift
Amazon RDSFiles Third-party
36. Getting data to Redshift using AWS Database
Migration Service (DMS)
Simple to use Minimal Downtime Supports most widely
used Databases
Low Cost Fast & Easy to Set-up Reliable
39. Extending your DWH (or Migrations) to
Redshift
http://amzn.to/2vN3UBO
Oracle to Redshift
40. Extending your DWH (or Migrations) to
Redshift
http://amzn.to/2wZy7OA
Teradata to Redshift
41. Extending your DWH (or Migrations) to
Redshift
http://amzn.to/2hbKwYd
Converge Silos to Redshift
42. Redshift Playbook
Part 1: Preamble, Prerequisites, and
Prioritization
Part 2: Distribution Styles and
Distribution Keys
Part 3: Compound and Interleaved
Sort Keys
Part 4: Compression Encodings
Part 5: Table Data Durability
amzn.to/2quChdM
44. Paths to Cloud Data Warehousing and Analytics
Extend
• Quickly meet business
demands
• More variety of data
formats for analysis
Migrate (‘Lift & Shift’)
• Current warehouse not
performing & need to
scale
• Reduce costs (platform &
maintenance)
Born in the cloud
• Agile Self-Service
Analytics
• Highly Scalable
• Elastic
45. Informatica supports both ETL and ELT Patterns
ETL (1, 2, 3)
1. Bulk Source Data Ingestion
2. Multi-part load into S3 of
compressed files
3. Copy S3 data into Amazon
Redshift Staging
ELT (4, 5, 6)
1. SQL Pushdown for Amazon
Redshift to Amazon Redshift
Table Integrations within same
cluster
Redshift
StagingAWS S3
Informatica Cloud/ PowerCenter
1
2
3
4 5
Redshift
Intermediate
Redshift
Analytics
6
4 5
Same Redshift Cluster
46. Optimized Data Ingestion into Amazon Redshift
1. Source Bulk Data Loader
2. Partitions - parallel data pipelines
3. Local staging files
4. S3 Parallel Upload
5. Copy Command to Redshift
48. Fox Entertainment– Migrate to Amazon Redshift
Goals: Universal Data warehouse across
business units in different global regions; Scale
and provide self service analytics at lower cost;
Accelerate Journey to AWS Cloud
Needs: Migrate from On-premise MPP Data;
Benefits:
• Repoint 6000 PC ETL mappings from
Netezza to Redshift; Able to reuse existing
Informatica workflows and migrate
quickly to Redshift
• Informatica SQL Pushdown (ELT) was
able to transform and push millions of
records every hour 24 x 7.
Logs,
Click Streams
CSV,
Social Feeds
S3
Staging Tables Intermediate Tables Analysis Tables
Oracle
SaaS
Migration to AWS Cloud; Reuse PowerCenter mappings
53. Adaptive Biotechnologies – Born in the Cloud
Goals: Acquisitions and growth
propelled the need to create a
DWH; Adhoc analytics for their
data scientists
Needs: Flexible and scalable
DWH/ETL; data models constantly
changing; easy to set up and
manage; cost effective
Benefits:
• Self service made easy with
Redshift and Informatica
• Informatica gracefully handled
HL7 and other B2B formats and
helped transport it via SFTP to
our collection partners
Born in the Cloud! Build a modern Data Warehouse (Redshift) and ETL
(Informatica Cloud)
Cloud
LIMS
Bioinformatics
Pipelines
Customer
Portal
File
s
Legacy
System
s
54. Amazon Redshift Connector Capabilities
Robust Comprehensive High Performance Secure Flexible
§ Error management,
Notifications, &
Alerts
§ Auto-handle
special characters
§ Dynamically create
targets
§ Pre and Post SQL
§ SQL Overrides
§ S3 data retention
policies
§ AWS Multi-Region
support
§ Partitioning
§ SQL Pushdown
§ Optimized Lookups
§ Multi-part Upload &
Download
§ Compression
before S3 Upload
§ AWS KMS Support
§ IAM Roles
§ Client & Server Side
Encryption
§ S3 VPC Endpoint
§ Secure Agent on
premise
§ Informatica Hosted
agent
§ Agent on AWS
§ Configurable S3
Copy Options
§ Dynamic S3 Buckets
55. Informatica Products on AWS
Power Center
Informatica
Cloud
Big Data
Management
Enterprise
Informatica
Catalog
Informatica
Cloud
Intelligent Data
Lake
Informatica Data
Quality
Enterprise
Informatica
Catalog
Power Center
Master Data
Management
Big Data
Management
Informatica Data
Quality
Certified
Available
56. Learn more…..
Learn & Prepare
• Cloud Analytics with
Informatica Cloud &
Amazon Redshift
• PowerCenter on AWS
• Data Lakes on AWS
Get Started on AWS MarketplaceDeep-Dive
57. AWS and Informatica Relationship Team
Romain Roullet - AWS ISV Success Manager - EMEA
https://www.linkedin.com/in/romainroullet/
Nitin Mathur - AWS Strategy & Business Development Leader - Global
https://www.linkedin.com/in/nitmathur/
Andrew McIntyre - Informatica Strategy & Business Development Leader -
Global
https://www.linkedin.com/in/andrew-mcintyre-a6799765/
Ian Paton - Informatica UK Partnerships
https://www.linkedin.com/in/ian-paton-%E2%98%81-6256837/
59. 59
The KCOM Approach
Consulting background, with Architect, DBA & DevOps resources
MVP Design
MVP Implement
Test & optimise
Iterate throughout the project lifecycle
60. 60
Data Management Project
• Volumes - 6 billion retail transactions and
60 million rows of customer viewing data
• Platform - AWS Redshift Massively
Parallel Processing (MPP) architecture
• ETL - ingress of 200GB compressed data
in 15 mins (2TB uncompressed)
• Performance - data matched between
two data sets in ~90 seconds (43 million
matched rows)
61. 61
Travel IndustryTicketing
• IaC for all components to facilitate CI/CD (5 environments) & Immutable builds
• IAM based permissions for Redshift
• Bulk Load with DataPipeline
• ETL Management with Step Functions
• Aggregation transforms within RedShift
• Schemas are generated to support each report type
• Reports are generated on a daily basis & on demand
• Encryption for data at rest within the system (KMS)