Amazon RedShift - Ianni Vamvadelis

Amazon Redshift
Intro, Details
Ianni Vamvadelis
Solutions Architect

Amazon DynamoDB
Fast,
Predictable,
Highly-‐Scalable
NoSQL
Data
Store

Amazon RDS
Managed
Rela=onal
Database
Service
for

MySQL,
Oracle
and
SQL
Server

Amazon ElastiCache
In-‐Memory
Caching
Service

Amazon Redshift
Fast,
Powerful,
Fully
Managed,
Petabyte-‐Scale

Data
Warehouse
Service

Compute Storage
AWS Global Infrastructure
Database
Application Services
Deployment & Administration
Networking
AWS Database
Services
Scalable High Performance
Application Storage in the Cloud

Amazon DynamoDB
Fast,
Predictable,
Highly-‐Scalable
NoSQL
Data
Store

Amazon RDS
Managed
Rela=onal
Database
Service
for

MySQL,
Oracle
and
SQL
Server

Amazon ElastiCache
In-‐Memory
Caching
Service

Amazon Redshift
Fast,
Powerful,
Fully
Managed,
Petabyte-‐Scale

Data
Warehouse
Service

Compute Storage
AWS Global Infrastructure
Database
Application Services
Deployment & Administration
Networking
AWS
Database

Services

Scalable High Performance
Application Storage in the Cloud

Design
Objec=ves

A
petabyte-‐scale
data
warehouse
service
that
was…

Amazon

RedshiL

A Whole Lot Simpler
A Lot Cheaper
A Lot Faster

RedshiL
Drama=cally
Reduces
I/O

•  Direct-‐aNached
storage

•  Large
data
block
sizes

•  Columnar
storage

•  Data
compression

•  Zone
maps

Id Age State
123 20 CA
345 25 WA
678 40 FL
Row storage Column storage

16GB RAM
2TB disk
2 cores
RedshiL
Runs
on
Op=mized
Hardware

•  Op=mized
for
I/O
intensive
workloads

•  HS1.8XL
available
on
Amazon
EC2

•  Runs
in
HPC
-‐
fast
network

•  High
disk
density

HS1.8XL: 128GB RAM, 16 Cores, 24 Spindles, 16TB Storage, 2GB/sec scan rate
HS1.XL: 16GB RAM, 2 Cores, 3 Spindles, 2TB Storage
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
16GB RAM
2TB disk
2 cores
Click to grow …
to 1.6PB

RedshiL
Parallelizes
and
Distributes
Everything

Load

Query

Resize

Backup

Restore

10
GigE

(HPC)

Inges=on

Backup

Restore

SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
Amazon S3
JDBC/ODBC

128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node

Point
and
Click
Resize

128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node
Resize
your
cluster
while
remaining
online

128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node
New
target
provisioned
in
the
background

Only
charged
for
source
cluster

Resize
your
cluster
while
remaining
online

•  Fully
automated

– Data
automa=cally
redistributed

•  Read
only
mode
during
resize

•  Parallel
node-‐to-‐node
data
copy

•  Automa=c
DNS-‐based
endpoint

cut-‐over

•  Only
charged
for
one
cluster

128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Compute
Node
128GB RAM
48TB disk
16 cores
Leader
Node

Amazon
RedshiL
has
security
built-‐in

•  SSL
to
secure
data
in
transit

•  Encryp=on
to
secure
data
at
rest

– AES-‐256

– All
blocks
on
disks
and
in
Amazon
S3

encrypted

•  No
direct
access
to
compute
nodes

•  Amazon
VPC
support

10
GigE

(HPC)

Inges=on

Backup

Restore

128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3
Customer
VPC

Internal

VPC

JDBC/ODBC

Leader
Node
Compute
Node
Compute
Node
Compute
Node

Con=nuous
Backup,
Automated
Recovery

•  Replica=on
within
the
cluster
and
backup
to
Amazon
S3
to

maintain
mul=ple
copies
of
data
at
all
=mes

•  Backups
to
Amazon
S3
are
con=nuous,
automa=c,
and

incremental

•  Con=nuous
monitoring
and
automated
recovery
from
failures
of

drives
and
nodes

•  Able
to
restore
snapshots
to
any
Availability
Zone
within
a
region

datavolume
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
data available
for analysis
data generated
Gap
cost
+

eﬀort

RedshiL
is
Priced
to
Analyze
All
Your
Data

$0.85 per hour for on-demand (2TB)
$999 per TB per year (3-yr reservation)

Integrates
With
Exis=ng
BI
Tools

Amazon Redshift
JDBC/ODBC

Repor=ng
Warehouse

•  Accelerated
opera=onal
repor=ng

•  Support
for
short-‐=me
use
cases

•  Data
compression,
index
redundancy

RDBMS
Redshift
OLTP
ERP Reporting
and BI

Data
Integration
Partners*
On-‐Premises
Integra=on

RDBMS
Redshift
OLTP
ERP Reporting
and BI

Live
Archive
for
(Structured)
Big
Data

•  Direct
integra=on
with
copy
command

•  High
velocity
data

•  Data
ages
into
RedshiL

•  Low
cost,
high
scale
op=on
for
new
apps

DynamoDB
Redshift
OLTP
Web Apps Reporting
and BI

Cloud
ETL
for
Big
Data

•  Maintain
online
SQL
access
to
historical
logs

•  Transforma=on
and
enrichment
with
EMR

•  Longer
history
ensures
beNer
insight

Redshift
Reporting
and BI
Elastic MapReduce
S3

Ingestion – Best Practices
§  Goal:
Leverage
all
the
compute
nodes
and
minimize
overhead

§  Best
Prac=ces

§  Preferred
method
-‐
COPY
from
S3

§  Loads
data
in
sorted
order
through
the
compute
nodes

§  Single
Copy
command,
Split
data
into
mul=ple
ﬁles

§  Strongly
recommend
that
you
gzip
large
datasets

§  If
you
must
ingest
through
SQL

§  Mul=-‐row
inserts

§  Avoid
large
number
of
singleton

insert/update/delete
opera=ons

§  To
copy
from
another
table

§  CREATE
TABLE
AS
or
INSERT
INTO
SELECT

insert into category_stage values!
(default, default, default, default),!
(20, default, 'Country', default),!
(21, 'Concerts', 'Rock', default);!
copy time from 's3://mybucket/data/timerows.gz’ credentials
'aws_access_key_id=<Your-Access-Key-ID>;aws_secret_access_key=<Your-Secret-Access-
Key>’ gzip delimiter '|’;!

Choose a Sort key
§  Goal

§  Skip
over
data
blocks
to
minimize
IO

§  Best
Prac=ce

§  Sort
based
on
range
or
equality
predicate
(WHERE
clause)

§  If
you
access
recent
data
frequently,
sort
based
on
TIMESTAMP

Choose a Distribution Key
§  Goal

§  Distribute
data
evenly
across
nodes

§  Minimize
data
movement
among
nodes
:
Co-‐located
Joins
and
Co-‐located
Aggregates

§  Best
Prac=ce

§  Consider
using
Join
key
as
distribu=on
key
(JOIN
clause)

§  If
mul=ple
joins,
use
the
foreign
key
of
the
largest
dimension
as
distribu=on
key

§  Consider
using
Group
By
column
as
distribu=on
key
(GROUP
BY
clause)

§  Avoid

§  Keys
used
as
equality
ﬁlter
as
your
distribu=on
key

§  If
de-‐normalized
tables
and
no
aggregates,
do
not
specify
a
distribu=on
key
-‐RedshiL
will

use
round
robin

Select sum( S.Price * S.Quantity )!
FROM SALES S!
JOIN CATEGORY C ON C.ProductId = S.ProductId!
JOIN FRANCHISE F ON F.FranchiseId = S.FranchiseId!
Where C.CategoryId = ‘Produce’ And F.State = ‘WA’!
AND S.Date Between ‘1/1/2013’ AND ‘1/31/2013’!
Example
Dist key (C) = ProductID

Sort key (S) = Date

-- Total Produce sold in Washington in January 2013
Dist key (F) = FranchiseID

Dist key (S) = ProductID

Workload Manager
§  Allows
you
to
manage
and
adjust
query
concurrency

§  WLM

allows
you
to

§  Increase
query
concurrency
up
to
15

§  Deﬁne
user
groups
and
query
groups

§  Segregate
short
and
long
running
queries

§  Help
improve
performance
of
individual
queries

§  Be
aware:
query
workload
is
distributed
to
every
compute
node

§  Increasing
concurrency
may
not
always
help
due
to
resource
conten=on

§  CPU,
Memory
and
I/O

§  Total
throughput
may
increase
by
lekng
one
query
complete
ﬁrst
and
allowing

other
queries
to
wait

Workload Manager
§  Default
:
1
queue
with
a
concurrency
of
5

§  Deﬁne
up
to
8
queues
with
a
total
concurrency
of
15

§  RedshiL
has
a
super
user
queue
internally

Query Performance – Best Practices
§  Encode
date
and
=me
using
“TIMESTAMP”
data
type
instead
of
“CHAR”

§  Specify
Constraints

§  RedshiL
does
not
enforce
constraints
(primary
key,
foreign
key,
unique
values)
but

the
op=mizer
uses
it

§  Loading
and/or
applica=ons
need
to
be
aware

§  Specify
redundant
predicate
on
the
sort
column

! !SELECT * FROM tab1, tab2 !
! !WHERE tab1.key = tab2.key !
! !AND tab1.timestamp > '1/1/2013' !
! !AND tab2.timestamp > '1/1/2013';!
§  WLM
sekngs

Summary
§  Avoid
large
number
of
singleton
DML
statements
if

possible

§  Use
COPY
for
uploading
large
datasets

§  Choose
Sort
and
Distribu=on
keys
with
care

§  Encode
data
and
=me
with
TIMESTAMP
data
type

§  Experiment
with
WLM
sekngs

More Information
Best
Prac=ces
for
Designing
Tables

http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html

Best
Prac=ces
for
Data
Loading

http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
View the Redshift Developer Guide at:
http://aws.amazon.com/documentation/redshift/

Thanks.
aws.amazon.com/big-data

Amazon RedShift - Ianni Vamvadelis

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Amazon RedShift - Ianni Vamvadelis

Semelhante a Amazon RedShift - Ianni Vamvadelis (20)

Mais de huguk

Mais de huguk (20)

Último

Último (20)

Amazon RedShift - Ianni Vamvadelis