(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

•What technologies should I use?
–Why?
–How?
•Reference architecture
•Design patterns

Glacier
S3 DynamoDB
RDS
EMR
Redshift
Data Pipeline
Kinesis
Cassandra CloudSearch
Kinesis-enabled
app

Ingest
Store
Process
Visualize

Glacier
S3
DynamoDB
RDS
Kinesis
Spark
Streaming
EMR
Data Pipeline
Storm
Kafka
Redshift
Cassandra
CloudSearch
Kinesis
Connector
Kinesis
enabled app

Database
Cloud Storage
Stream
Storage

Stream Storage
Database
Cloud Storage

Amazon Kinesis or Kafka
4
4
3
3
2
2
1
1
4
3
2
1
4
3
2
1
4
3
2
1
4
3
2
1
4
4
3
3
2
2
1
1
Shard or Partition 1

Amazon Kinesis or Kafka
4
4
3
3
2
2
1
1
4
3
2
1
4
3
2
1
4
3
2
1
4
3
2
1
4
4
3
3
2
2
1
1
Consumer 1
Count of Red = 4
Count of Violet = 4
Consumer 2
Count of Blue = 4
Count of Green = 4

App/Web Tier
Client Tier
Database & Storage Tier

App/Web Tier
Client Tier
Data Tier
Search
Hadoop/HDFS
Cache
Blob Store
SQL
NoSQL

Amazon Amazon RDS
DynamoDB
Amazon
ElastiCache
Amazon S3
Amazon
Glacier
Amazon
CloudSearch
HDFS on Amazon EMR

Structured –Simple Query
NoSQL
Amazon DynamoDB
Cache
Amazon ElastiCache
Structured –Complex Query
SQL
Amazon RDS
Search
Amazon CloudSearch
Unstructured –No Query
Cloud Storage
Amazon S3
Amazon Glacier
Unstructured –Custom Query
Hadoop/HDFS
Amazon Elastic MapReduce
Data Structure Complexity
Query Structure Complexity

Hot
Warm
Cold
Volume
MB–GB
GB–TB
PB
Item size
B–KB
KB–MB
KB–TB
Latency
ms
ms, sec
min, hrs
Durability
Low–High
High
Very High
Requestrate
Very High
High
Low
Cost/GB
$$-$
$-¢¢
¢

AmazonRDS
Request Rate
High
Low
Cost/GB
High
Low
Latency
Low
High
Data Volume
Low
High
AmazonGlacier
AmazonCloudSearch
Structure
Low
High
AmazonDynamoDB
AmazonElastiCache

Amazon ElastiCache
AmazonDynamoDB
AmazonRDS
Amazon
CloudSearch
Amazon
EMR (HDFS)
Amazon S3
AmazonGlacier
Average latency
ms
ms
ms, sec
ms,sec
sec,min,hrs
ms,sec,min
(~ size)
hrs
Data volume
GB
GB–TBs
(nolimit)
GB–TB
(3 TB Max)
GB–TB
GB–PB
(~nodes)
GB–PB
(nolimit)
GB–PB
(nolimit)
Item size
B-KB
KB
(64 KB max)
KB
(~rowsize)
KB
(1 MB max)
MB-GB
KB-GB
(5 TBmax)
GB(40 TB max)
Request rate
Very High
Very High
High
High
Low –Very High
Low–
Very High
(nolimit)
Very Low
(nolimit)
Storage cost
$/GB/month
$$
¢¢
¢¢
$
¢
¢
¢
Durability
Low - Moderate
Very High
High
High
High
Very High
Very High
Hot Data
Warm Data
Cold Data

Use Case: A Video Streaming Application

Use Case: A Video Streaming App – Upload
Amazon
DynamoDB
Amazon
RDS
Amazon
CloudSearch
Amazon S3

A Video Streaming App – Discovery
X
Amazon
ElastiCache
CloudFront
Amazon
DynamoDB
Amazon
RDS
Amazon
CloudSearch
Amazon S3

Batch Processing
•Take large amount of cold data and ask questions
•Takes minutes or hours to get answers back
Example: Generating hourly, daily, weekly reports

Use Case: Video Recommendations
Amazon
S3
Amazon
Glacier
Amazon
DynamoDB
Amazon
EMR

Use Case: Batch Analytics
Amazon
EMR
Amazon
S3
Amazon
Glacier
Amazon
Redshift

Stream Processing (AKAReal Time)
•Take small amount of hot data and ask questions
•Takes short amount of time to get your answer back
Example: 1min metrics

https://amplab.cs.berkeley.edu/benchmark/

Redshift
Impala
Presto
Spark
Hive
Query Latency
Low
Low
Low
Low -Medium
Medium -High
Durability
High
High
High
High
High
Data Volume
1.6PB Max
~Nodes
~Nodes
~Nodes
~Nodes
Managed
Yes
EMR bootstrap
EMR
bootstrap
EMR bootstrap
Yes (EMR)
Storage
Native
HDFS
HDFS/S3
HDFS/S3
HDFS/S3
# of BI Tools
High
Medium
High
Low
High

Spark Streaming
Apache Storm + Trident
KinesisClient Library
Scale/Throughput
~ Nodes
~ Nodes
~ Nodes
Data Volume
~ Nodes
~ Nodes
~ Nodes
Manageability
Yes(EMR bootstrap)
Doit yourself
EC2 + AutoScaling
Fault Tolerance
Built-in
Built-in
KCLCheck pointing
Programminglanguages
Java, Python, Scala
Java, Scala, Clojure
Java, Python

Process
Store
Process
Store

Amazon Kinesis
Amazon Kinesis
Connectors
Amazon S3
Amazon DynamoDB

Amazon Kinesis
Amazon Kinesis
Connectors
Amazon S3
Amazon DynamoDB
Hive
Spark
Storm

Amazon Kinesis / KafkaNoSQL / AmazonDynamoDBAmazon S3DevicesLoggingPrestoHiveAmazonRedshiftSpark Streaming StormNative ClientAmazonRedshiftNative ClientHiveHDFSPrestoHiveImpalaAppsAmazonCloudSearchSparkBI & Visualization toolsSparkHive

Spark Streaming, Apache Storm
Amazon Redshift
Spark, Impala, Presto
Hive
AmazonRedshift
Hive
Spark, Presto
Amazon Kinesis/
Kafka
Amazon DynamoDB
Amazon S3
Data
Hot
Cold
Data Temperature
Query Latency
Low
High
Answers
HDFS
Hive
Native Client

Spark Streaming
Hive
Amazon Kinesis / Kafka
Data
Answers
Apache Storm
Native Client
Amazon DynamoDB
Native Client

AmazonRedshift
Hive
Spark, Presto
Amazon Kinesis/
Kafka
Amazon S3
Data
Answers

Spark, Impala, Presto
Redshift
Spark, Presto
Kinesis/
Kafka
DynamoDB
S3
Data
Answers
HDFS

•Big data processing stages: ingest, storage, process, and visualize
•Use the right tool for the job
–Ingest: Transactional data, file data, stream data
–Storage: Data structure, query patterns, hot vs cold etc.
–Processing: Query latency
•Big data reference architecture and design patterns

Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals

(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014

Similar to (BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(BDT310) Big Data Architectural Patterns and Best Practices on AWS | AWS re:Invent 2014