Antoine Genereux takes us on a detailed overview of the Database solutions available on the AWS Cloud, addressing the needs and requirements of customers at all levels. He also discusses Business Intelligence and Analytics solutions.
3. What is Big Data?
When your data sets become so large and diverse
that you have to start innovating around how to
collect, store, process, analyze and share them
4. How do I absorb multiple sources of data?
How do I store my data?
How do I control access to my data?
How do I keep track of the changes to my data?
How do I know I’m picking the right analysis tool?
How do I know I’m asking the right questions with my data?
How can I make reporting easier and cheaper?
Is there an easier way to do ETL?
How do I move away from my RDBMS?
How do I translate analysis to business answers?
How do I future-proof my architecture?
How do I get started?
How do I keep my costs down?
How do I give access to my different teams?
Big Data Questions and Challenges
5. 1990 2000 2010 2020
Generated Data
Available for Analysis
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Data Volume
Year
The Dark Data Problem
Most generated data is unavailable for analysis
6. “Data is the New Oil” - Brian Krzanich, CEO Intel
9. Simplify Big Data Processing
Time to answer (Latency)
Throughput
Cost
Ingest/
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers &
insights
10. Architectural Principles
Build decoupled systems
• Data → Store → Process → Store →Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views (schema-on-read)
Be cost-conscious
• Big data ≠ big cost
12. RECORDS
In-memory data structures
Database records
DOCUMENTS Search documents
Log files
Messages
Data streams
Transactions
Files
Events
Types of Data
Ingest/
Collect
1 4
0 9
5
Web apps
Mobile apps
Data center apps
Application Logging
Bulk Data Transfer
Messaging
Devices
Sensors
FILES
MESSAGES
STREAMS
13. Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency μs-ms ms-sec min-hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Temperature
Ingest/
Collect
1 4
0 9
5
14. Database SQL & NoSQL databases
Search Search engines
File/Object
store
File systems
Queue Message queues
Stream
storage
Pub/sub message queues
In-memory Caches, data structure servers
Types of Data Stores
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Store
17. Why is S3 Good for Analytics?
Amazon S3 Core Features
• Unlimited number of objects and volume of data
• Very high bandwidth – no aggregate throughput limit
• Designed for 99.99% availability – can tolerate zone failure
• Designed for 99.999999999% durability
• 3x Data replication included with the service
Store
18. Why is S3 Good for Analytics?
Decoupling Storage and Compute
• Natively supported by big data frameworks (Spark, Hive,
Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot
Instances (Auto-Scaling, Spot Blocks, Spot Fleets)
• Multiple & heterogeneous analysis clusters can use the
same data
Store
19. Amazon S3 Additional Features
• Native support for versioning, object tagging
• Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle
policies
• Tiering optimization through S3 Analytics
• Secure – SSL, client/server-side encryption at rest and
granular access policies
• Low cost (0.025$/GB in Canada)
Why is S3 Good for Analytics?
Store
23. Use the Right Tool for the Job
Search
Amazon Elasticsearch
Service
In-memory
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
MySQL
PostgreSQL
Oracle
SQL Server
MariaDB
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB
24. Microsecond Real-Time Performance
Fully Managed
Redis Automatic Failover = NoOps
Enhanced Redis Engine
No Cross-AZ Data Transfer Costs
Easy to Deploy, Use and Monitor
Open-Source Compatible
Amazon
ElastiCache
26. Amazon
RDS
Automated backups (with point-in-time recovery)
Cross-region snapshot copies
Automated patch management
Automated Multi-AZ replication
Scale up / Scale down instance types
Scalable storage on demand
“License included” and BYOL models
27. Amazon Aurora: MySQL and PostgreSQL-compatible
SQL
Trans-
actions
AZ 1 AZ 2 AZ 3
Caching
Amazon
S3
• 5x faster than MySQL on
same hardware
• SysBench: 100 K writes/sec
and 500 K reads/sec
• Designed for 99.99%
availability
• 6-way replicated storage
across 3 AZs
• Scale to 64 TB and 15 Read
Replicas
28. How Ticketmaster uses RDS-Aurora
Quick Statistics
• Top 5 ecommerce site
• 26,000 Live Nation Events per year
• 530M Fans in more than 37 countries
• 465M ticket transactions annually
• 1B+ unique visits to web front end
• 400K concert tickets sold in a morning
Account Manager migration
• MySQL 5.6 to Aurora
• 12 MySQL servers to 5 Aurora nodes
• Deployment time from 1-2h to 20 minutes
• Test environment build went from 1 week
to 30 minutes
Terraformer
• Infrastructure as code tool for
databases
With Aurora:
• Scale In/Out horizontally within
10 minutes, without downtime
• Scale Up/Down vertically with 30
seconds of failover downtime
https://aws.amazon.com/solutions/case-studies/ticketmaster/
29. • Start your first migration in 10 minutes or less
• Keep your apps running during the migration
• Replicate within, to, or from Amazon EC2 or RDS
• Move data to the same or a different database engine
AWS Database Migration Service
32. Process/
analyze
Analytics Types & Frameworks
Type Time to Answer Example Services/
Frameworks
Batch min - hrs • Reporting, BI
• ML Training
Amazon EMR
Amazon Redshift
Interactive sec • Data Exploration
• Data Forensics
Amazon EMR
Amazon Athena
Redshift Spectrum
Message ms - sec • Message processing Amazon SQS
Stream ms - sec • Alerts & Notifications
• Real-Time
Dashboards
Amazon EMR
Amazon Kinesis
AWS Lambda
Apache Storm
Machine Learning ms - min • Demand Forecast
• Recommendations
Amazon ML
Amazon EMR
33. Relational Columnar MPP data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
Process/
analyze
Batch Analytics – Amazon Redshift
34. Amazon Redshift architecture
Leader node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
Process/
analyze
35. Why migrate to Amazon Redshift?
100x faster
Scales from GBs to PBs
Analyze data without storage
constraints
10x cheaper
Easy to provision and operate
Higher productivity
10x faster
No programming
Standard interfaces and
integration to leverage BI tools,
machine learning, streaming
Transactional database MPP database Hadoop
Process/
analyze
36. Migration from Oracle @
Boingo Wireless
2000+ Commercial Wi-Fi locations
1 million+ Hotspots
90M+ ad engagements
100+ countries
Legacy DW: Oracle 11g based DW
Before migration:
Rapid data growth slowed
analytics
Low IOPS, limited memory,
vertical scaling
Admin overhead
Expensive (license, h/w, support)
After migration:
180x performance improvement
7x cost savings
Process/
analyze
https://aws.amazon.com/solutions/case-studies/boingo-wireless/
38. Interactive Analytics
Amazon Athena
• Managed query service for
Amazon S3 data
• Zero Spin up time
• Presto and Hive based
• No data load, ETL
• CSV, Web Log, TSV, JSON,
Parquet, ORC, AVRO
support
• JDBC Driver
Redshift Spectrum
• Exabyte-scale queries on
S3 directly from Redshift
• High concurrency, elastic
and highly available
• No data load, ETL
• Full Redshift SQL support
• Redshift optimizations
• JDBC/ODBC support
Process/
analyze
39. Real-Time Analytics with
MLB Statcast
• Ball position sampled
2000 times/second
• Player position sampled
30 times/second
• 12 second time-to-
answer
• 7TB of data generated
per game
Process/
analyze
https://aws.amazon.com/solutions/case-studies/major-league-baseball-mlbam/
41. What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
Data Integration Partners
Reduce the effort to move, cleanse, synchronize,
manage, and automatize data related processes.
AWS Glue
AWS Glue is a fully managed ETL service that makes
it easy to understand your data sources, prepare the
data, and move it reliably between data stores
Preview
Process/
analyze
Store Process/
analyze
ETL
42. IDE
Apps & Services
API
Applications & API
Analysis and visualization
Notebooks
IDE
Business
users
Amazon QuickSight
Analysis&visualizationNotebooks
Data scientist,
developers
Data Consumption & Visualization
Consume/
visualize
43. Business Users
QuickSight API
Data prep Metadata SuggestionsConnectors SPICE
QuickSight UI
Mobile devices Web browsers
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
EMR
Amazon
Redshift
Amazon RDSFiles Apps
Direct connect
JDBC/ODBC
On-premises data
Amazon
Athena
Amazon Quicksight
Consume/
visualize
Partner BI Products
45. Real-Time Analytics and ML - DataXu
Who
• Spun out of MIT Labs
• A petabyte scale digital
marketing platform
• One of the fastest growing
companies in Inc. 5000
What
• Help world’s most valuable
brands understand and
engage with their consumer
• Maximize ROI
Quick Statistics
• 2M+ bid requests per second
• Billions of impressions,
Petabytes of data
• 180+TB logs per day
• 2PB data analyzed daily
• 3000+ servers powering the
platform
• 13 regions, 24x7
https://aws.amazon.com/solutions/case-studies/dataxu/
47. DataXu Data Flow (Batch Layer)
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Streams
Advanced Analytics
(Third Party)
Reporting Tools
(Third Party)Amazon
Machine
Learning
S3All Data
(S3)
ETL
Attribution
Ecosystem of tools and services
Amazon
Athena
48. Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data Lake
Amazon EMR
Amazon
Kinesis
Amazon Redshift
Answers &
insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting
https://aws.amazon.com/solutions/case-studies/redfin/
49. Data Marts
(Amazon
Redshift)
Query Cluster
(EMR)
Query Cluster
(EMR)
Auto Scaling
EC2
Analytics
App
Normalization
ETL Clusters
(EMR)
Batch Analytic
Clusters
(EMR)
Ad Hoc Query
Cluster (EMR)
Auto Scaling
EC2
Analytics
App
Users Data
Providers
Auto Scaling
EC2
Data
Ingestion
Services
Optimization
ETL Clusters
(EMR)
Shared Metastore
(RDS)
Query Optimized
(S3)
Auto Scaling EC2
Data
Catalog
& Lineage
Services
Reference Data
(RDS)
Shared Data Services
Auto Scaling
EC2
Cluster Mgt
& Workflow
Services
Source of
Truth (S3)
5+ PB Data Lake
Up to 75 billion market events analyzed per day
https://aws.amazon.com/solutions/case-studies/finra/
52. Architectural Principles
Build decoupled systems
• Data → Store → Process → Store →Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views (schema-on-read)
Be cost-conscious
• Big data ≠ big cost
55. Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Characteristics: Hot, Warm, Cold
56. Which Stream/Message Storage Should I Use?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
SQS
(Standard)
Amazon SQS
(FIFO)
AWS managed Yes Yes Yes No Yes Yes
Guaranteed ordering Yes Yes No Yes No Yes
Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once
Data retention period 24 hours 7 days N/A Configurable 14 days 14 days
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ
Scale /
throughput
No limit /
~ table IOPS
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS /
queue
Parallel consumption Yes Yes No Yes No No
Stream MapReduce Yes Yes N/A Yes N/A N/A
Row/object size 400 KB 1 MB Destination
row/object size
Configurable 256 KB 256 KB
Cost Higher (table
cost)
Low Low Low (+admin) Low-medium Low-medium
Hot Warm
57. Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
ES
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
B-KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
Which Data Store Should I Use?
58. Which Stream/Message Processing Technology Should I Use?
Amazon
EMR (Spark
Streaming)
Apache
Storm
KCL Application Amazon Kinesis
Analytics
AWS
Lambda
Amazon SQS
Application
AWS
managed
Yes (Amazon
EMR)
No (Do it
yourself)
No ( EC2 + Auto
Scaling)
Yes Yes No (EC2 + Auto
Scaling)
Serverless No No No Yes Yes No
Scale /
throughput
No limits /
~ nodes
No limits /
~ nodes
No limits /
~ nodes
Up to 8 KPU /
automatic
No limits /
automatic
No limits /
~ nodes
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java,
Python,
Scala
Almost any
language via
Thrift
Java, others via
MultiLangDaemon
ANSI SQL with
extensions
Node.js,
Java,
Python
AWS SDK
languages (Java,
.NET, Python, …)
Uses Multistage
processing
Multistage
processing
Single stage
processing
Multistage
processing
Simple
event-based
triggers
Simple event
based triggers
Reliability KCL and
Spark
checkpoints
Framework
managed
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by
AWS
Lambda
Managed by SQS
Visibility Timeout
59. Which Analysis Tool Should I Use?
Amazon Redshift Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Ad-hoc Interactive
Queries
Interactive
Query
General purpose
(iterative ML, RT, ..)
Batch
Scale/throughput ~Nodes Automatic / No limits ~ Nodes
AWS Managed
Service
Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3, HDFS
Optimization Columnar storage, data
compression, and zone
maps
CSV, TSV, JSON,
Parquet, ORC, Apache
Web log
Framework dependent
Metadata Amazon Redshift managed Athena Catalog
Manager
Hive Meta-store
BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom)
Access controls Users, groups, and access
controls
AWS IAM Integration with LDAP
UDF support Yes (Scalar) No Yes
Slow