In this talk, Ian will table about Amazon Redshift, a managed petabyte scale data warehouse, give an overview of integration with Amazon Elastic MapReduce, a managed Hadoop environment, and cover some exciting new developments in the analytics space.
2. Amazon DynamoDB
Fast,
Predictable,
Highly-‐Scalable
NoSQL
Data
Store
Amazon RDS
Managed
Rela=onal
Database
Service
for
MySQL,
Oracle
and
SQL
Server
Amazon ElastiCache
In-‐Memory
Caching
Service
Amazon Redshift
Fast,
Powerful,
Fully
Managed,
Petabyte-‐Scale
Data
Warehouse
Service
Compute Storage
AWS Global Infrastructure
Database
Application Services
Deployment & Administration
Networking
AWS Database
Services
Scalable High Performance
Application Storage in the Cloud
3. Amazon DynamoDB
Fast,
Predictable,
Highly-‐Scalable
NoSQL
Data
Store
Amazon RDS
Managed
Rela=onal
Database
Service
for
MySQL,
Oracle
and
SQL
Server
Amazon ElastiCache
In-‐Memory
Caching
Service
Amazon Redshift
Fast,
Powerful,
Fully
Managed,
Petabyte-‐Scale
Data
Warehouse
Service
Compute Storage
AWS Global Infrastructure
Database
Application Services
Deployment & Administration
Networking
AWS
Database
Services
Scalable High Performance
Application Storage in the Cloud
4. Design
Objec=ves
A
petabyte-‐scale
data
warehouse
service
that
was…
Amazon
RedshiL
A Whole Lot Simpler
A Lot Cheaper
A Lot Faster
5. RedshiL
Drama=cally
Reduces
I/O
• Direct-‐aNached
storage
• Large
data
block
sizes
• Columnar
storage
• Data
compression
• Zone
maps
Id Age State
123 20 CA
345 25 WA
678 40 FL
Row storage Column storage
6. 16GB RAM
2TB disk
2 cores
RedshiL
Runs
on
Op=mized
Hardware
• Op=mized
for
I/O
intensive
workloads
• HS1.8XL
available
on
Amazon
EC2
• Runs
in
HPC
-‐
fast
network
• High
disk
density
HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate
HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
Click to grow …
to 1.6PB
9. SQL Clients/BI Tools
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node
Resize
your
cluster
while
remaining
online
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node
New
target
provisioned
in
the
background
Only
charged
for
source
cluster
10. Resize
your
cluster
while
remaining
online
• Fully
automated
– Data
automa=cally
redistributed
• Read
only
mode
during
resize
• Parallel
node-‐to-‐node
data
copy
• Automa=c
DNS-‐based
endpoint
cut-‐over
• Only
charged
for
one
cluster
SQL Clients/BI Tools
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node
11. Amazon
RedshiL
has
security
built-‐in
• SSL
to
secure
data
in
transit
• Encryp=on
to
secure
data
at
rest
– AES-‐256
– All
blocks
on
disks
and
in
Amazon
S3
encrypted
• No
direct
access
to
compute
nodes
• Amazon
VPC
support
10
GigE
(HPC)
Inges=on
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3
Customer
VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node
12. Con=nuous
Backup,
Automated
Recovery
• Replica=on
within
the
cluster
and
backup
to
Amazon
S3
to
maintain
mul=ple
copies
of
data
at
all
=mes
• Backups
to
Amazon
S3
are
con=nuous,
automa=c,
and
incremental
• Con=nuous
monitoring
and
automated
recovery
from
failures
of
drives
and
nodes
• Able
to
restore
snapshots
to
any
Availability
Zone
within
a
region
13. datavolume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
data available
for analysis
data generated
Gap
cost
+
effort
14. RedshiL
is
Priced
to
Analyze
All
Your
Data
$0.85 per hour for on-demand (2TB)
$999 per TB per year (3-yr reservation)
17. Repor=ng
Warehouse
• Accelerated
opera=onal
repor=ng
• Support
for
short-‐=me
use
cases
• Data
compression,
index
redundancy
RDBMS
Redshift
OLTP
ERP Reporting
and BI
19. Live
Archive
for
(Structured)
Big
Data
• Direct
integra=on
with
copy
command
• High
velocity
data
• Data
ages
into
RedshiL
• Low
cost,
high
scale
op=on
for
new
apps
DynamoDB
Redshift
OLTP
Web Apps Reporting
and BI
20. Cloud
ETL
for
Big
Data
• Maintain
online
SQL
access
to
historical
logs
• Transforma=on
and
enrichment
with
EMR
• Longer
history
ensures
beNer
insight
Redshift
Reporting
and BI
Elastic MapReduce
S3
21. Ingestion – Best Practices
§ Goal:
Leverage
all
the
compute
nodes
and
minimize
overhead
§ Best
Prac=ces
§ Preferred
method
-‐
COPY
from
S3
§ Loads
data
in
sorted
order
through
the
compute
nodes
§ Single
Copy
command,
Split
data
into
mul=ple
files
§ Strongly
recommend
that
you
gzip
large
datasets
§ If
you
must
ingest
through
SQL
§ Mul=-‐row
inserts
§ Avoid
large
number
of
singleton
insert/update/delete
opera=ons
§ To
copy
from
another
table
§ CREATE
TABLE
AS
or
INSERT
INTO
SELECT
insert into category_stage values!
(default, default, default, default),!
(20, default, 'Country', default),!
(21, 'Concerts', 'Rock', default);!
copy time from 's3://mybucket/data/timerows.gz’ credentials
'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-
Key>’ gzip delimiter '|’;!
22. Choose a Sort key
§ Goal
§ Skip
over
data
blocks
to
minimize
IO
§ Best
Prac=ce
§ Sort
based
on
range
or
equality
predicate
(WHERE
clause)
§ If
you
access
recent
data
frequently,
sort
based
on
TIMESTAMP
23. Choose a Distribution Key
§ Goal
§ Distribute
data
evenly
across
nodes
§ Minimize
data
movement
among
nodes
:
Co-‐located
Joins
and
Co-‐located
Aggregates
§ Best
Prac=ce
§ Consider
using
Join
key
as
distribu=on
key
(JOIN
clause)
§ If
mul=ple
joins,
use
the
foreign
key
of
the
largest
dimension
as
distribu=on
key
§ Consider
using
Group
By
column
as
distribu=on
key
(GROUP
BY
clause)
§ Avoid
§ Keys
used
as
equality
filter
as
your
distribu=on
key
§ If
de-‐normalized
tables
and
no
aggregates,
do
not
specify
a
distribu=on
key
-‐RedshiL
will
use
round
robin
24. Select sum( S.Price * S.Quantity )!
FROM SALES S!
JOIN CATEGORY C ON C.ProductId = S.ProductId!
JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId!
Where C.CategoryId = ‘Produce’ And F.State = ‘WA’!
AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’!
Example
Dist key (C) = ProductID
Sort key (S) = Date
-- Total Produce sold in Washington in January 2013
Dist key (F) = FranchiseID
Dist key (S) = ProductID
25. Workload Manager
§ Allows
you
to
manage
and
adjust
query
concurrency
§ WLM
allows
you
to
§ Increase
query
concurrency
up
to
15
§ Define
user
groups
and
query
groups
§ Segregate
short
and
long
running
queries
§ Help
improve
performance
of
individual
queries
§ Be
aware:
query
workload
is
distributed
to
every
compute
node
§ Increasing
concurrency
may
not
always
help
due
to
resource
conten=on
§ CPU,
Memory
and
I/O
§ Total
throughput
may
increase
by
lekng
one
query
complete
first
and
allowing
other
queries
to
wait
26. Workload Manager
§ Default
:
1
queue
with
a
concurrency
of
5
§ Define
up
to
8
queues
with
a
total
concurrency
of
15
§ RedshiL
has
a
super
user
queue
internally
27. Query Performance – Best Practices
§ Encode
date
and
=me
using
“TIMESTAMP”
data
type
instead
of
“CHAR”
§ Specify
Constraints
§ RedshiL
does
not
enforce
constraints
(primary
key,
foreign
key,
unique
values)
but
the
op=mizer
uses
it
§ Loading
and/or
applica=ons
need
to
be
aware
§ Specify
redundant
predicate
on
the
sort
column
! !SELECT * FROM tab1, tab2 !
! !WHERE tab1.key = tab2.key !
! !AND tab1.timestamp > '1/1/2013' !
! !AND tab2.timestamp > '1/1/2013';!
§ WLM
sekngs
28. Summary
§ Avoid
large
number
of
singleton
DML
statements
if
possible
§ Use
COPY
for
uploading
large
datasets
§ Choose
Sort
and
Distribu=on
keys
with
care
§ Encode
data
and
=me
with
TIMESTAMP
data
type
§ Experiment
with
WLM
sekngs
29. More Information
Best
Prac=ces
for
Designing
Tables
http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
Best
Prac=ces
for
Data
Loading
http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
View the Redshift Developer Guide at:
http://aws.amazon.com/documentation/redshift/