The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many technologies for solving big data problems. But what services should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
24. Hot
Warm
Cold
Volume
MB–GB
GB–TB
PB
Item size
B–KB
KB–MB
KB–TB
Latency
ms
ms, sec
min, hrs
Durability
Low–High
High
Very High
Requestrate
Very High
High
Low
Cost/GB
$$-$
$-¢¢
¢
25. AmazonRDS
Request Rate
High
Low
Cost/GB
High
Low
Latency
Low
High
Data Volume
Low
High
AmazonGlacier
AmazonCloudSearch
Structure
Low
High
AmazonDynamoDB
AmazonElastiCache
26. Amazon ElastiCache
AmazonDynamoDB
AmazonRDS
Amazon
CloudSearch
Amazon
EMR (HDFS)
Amazon S3
AmazonGlacier
Average latency
ms
ms
ms, sec
ms,sec
sec,min,hrs
ms,sec,min
(~ size)
hrs
Data volume
GB
GB–TBs
(nolimit)
GB–TB
(3 TB Max)
GB–TB
GB–PB
(~nodes)
GB–PB
(nolimit)
GB–PB
(nolimit)
Item size
B-KB
KB
(64 KB max)
KB
(~rowsize)
KB
(1 MB max)
MB-GB
KB-GB
(5 TBmax)
GB(40 TB max)
Request rate
Very High
Very High
High
High
Low –Very High
Low–
Very High
(nolimit)
Very Low
(nolimit)
Storage cost
$/GB/month
$$
¢¢
¢¢
$
¢
¢
¢
Durability
Low - Moderate
Very High
High
High
High
Very High
Very High
Hot Data
Warm Data
Cold Data
33. Batch Processing
•Take large amount of cold data and ask questions
•Takes minutes or hours to get answers back
Example: Generating hourly, daily, weekly reports
34. Use Case: Video Recommendations
Amazon
S3
Amazon
Glacier
Amazon
DynamoDB
Amazon
EMR
36. Stream Processing (AKAReal Time)
•Take small amount of hot data and ask questions
•Takes short amount of time to get your answer back
Example: 1min metrics
39. Redshift
Impala
Presto
Spark
Hive
Query Latency
Low
Low
Low
Low -Medium
Medium -High
Durability
High
High
High
High
High
Data Volume
1.6PB Max
~Nodes
~Nodes
~Nodes
~Nodes
Managed
Yes
EMR bootstrap
EMR
bootstrap
EMR bootstrap
Yes (EMR)
Storage
Native
HDFS
HDFS/S3
HDFS/S3
HDFS/S3
# of BI Tools
High
Medium
High
Low
High
53. •Big data processing stages: ingest, storage, process, and visualize
•Use the right tool for the job
–Ingest: Transactional data, file data, stream data
–Storage: Data structure, query patterns, hot vs cold etc.
–Processing: Query latency
•Big data reference architecture and design patterns
54. Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.
http://bit.ly/awsevals