Database and Analytics on the AWS Cloud

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Antoine Généreux, Solutions Architect, AWS Canada
May 24, 2017
Databases and Analytics on
the AWS Cloud

What is Big Data?
When your data sets become so large and diverse
that you have to start innovating around how to
collect, store, process, analyze and share them

How do I absorb multiple sources of data?
How do I store my data?
How do I control access to my data?
How do I keep track of the changes to my data?
How do I know I’m picking the right analysis tool?
How do I know I’m asking the right questions with my data?
How can I make reporting easier and cheaper?
Is there an easier way to do ETL?
How do I move away from my RDBMS?
How do I translate analysis to business answers?
How do I future-proof my architecture?
How do I get started?
How do I keep my costs down?
How do I give access to my different teams?
Big Data Questions and Challenges

1990 2000 2010 2020
Generated Data
Available for Analysis
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Data Volume
Year
The Dark Data Problem
Most generated data is unavailable for analysis

“Data is the New Oil” - Brian Krzanich, CEO Intel

Evolution of Analytics
Batch analytics
Real-time analytics
Predictive/Adaptive analytics

Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Amazon Kinesis
Streams app
Lambda Amazon ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Kinesis
Analytics
A plethora of tools

Simplify Big Data Processing
Time to answer (Latency)
Throughput
Cost
Ingest/
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers &
insights

Architectural Principles
Build decoupled systems
• Data → Store → Process → Store →Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views (schema-on-read)
Be cost-conscious
• Big data ≠ big cost

RECORDS
In-memory data structures
Database records
DOCUMENTS Search documents
Log files
Messages
Data streams
Transactions
Files
Events
Types of Data
Ingest/
Collect
1 4
0 9
5
Web apps
Mobile apps
Data center apps
Application Logging
Bulk Data Transfer
Messaging
Devices
Sensors
FILES
MESSAGES
STREAMS

Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency μs-ms ms-sec min-hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Temperature
Ingest/
Collect
1 4
0 9
5

Database SQL & NoSQL databases
Search Search engines
File/Object
store
File systems
Queue Message queues
Stream
storage
Pub/sub message queues
In-memory Caches, data structure servers
Types of Data Stores
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Store

Why Stream Storage and Queues?
• Parallel consumption*
• Preserve client ordering*
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
Producer 2
Producer 3
Producer n
Key = violet
DynamoDB stream Amazon Kinesis stream Kafka topic
Store • Decouple producers & consumers
• Collect multiple streams
• Persistent buffer
• Streaming MapReduce

Database
Search
File store
Queue
Stream
storage
In-memory
Object Storage
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Store
Amazon S3

Why is S3 Good for Analytics?
Amazon S3 Core Features
• Unlimited number of objects and volume of data
• Very high bandwidth – no aggregate throughput limit
• Designed for 99.99% availability – can tolerate zone failure
• Designed for 99.999999999% durability
• 3x Data replication included with the service
Store

Decoupling Storage and Compute
• Natively supported by big data frameworks (Spark, Hive,
Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot
Instances (Auto-Scaling, Spot Blocks, Spot Fleets)
• Multiple & heterogeneous analysis clusters can use the
same data
Store

Amazon S3 Additional Features
• Native support for versioning, object tagging
• Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle
policies
• Tiering optimization through S3 Analytics
• Secure – SSL, client/server-side encryption at rest and
granular access policies
• Low cost (0.025$/GB in Canada)
Store

Database
Search
File store
Queue
Stream
storage
In-memory
File Storage
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Store

Use the Right Tool for the Job
Search
Service
In-memory
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
MySQL
PostgreSQL
Oracle
SQL Server
MariaDB
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB

Microsecond Real-Time Performance
Fully Managed
Redis Automatic Failover = NoOps
Enhanced Redis Engine
No Cross-AZ Data Transfer Costs
Easy to Deploy, Use and Monitor
Open-Source Compatible
Amazon
ElastiCache

Fully managed NoSQL
Document / Key-Value store
Single-digit millisecond latency
Massive and seamless scalability
Event-driven programming
DynamoDB Accelerator (DAX)
Amazon
DynamoDB
Preview

Amazon
RDS
Automated backups (with point-in-time recovery)
Cross-region snapshot copies
Automated patch management
Automated Multi-AZ replication
Scale up / Scale down instance types
Scalable storage on demand
“License included” and BYOL models

Amazon Aurora: MySQL and PostgreSQL-compatible
SQL
Trans-
actions
AZ 1 AZ 2 AZ 3
Caching
Amazon
S3
• 5x faster than MySQL on
same hardware
• SysBench: 100 K writes/sec
and 500 K reads/sec
• Designed for 99.99%
availability
• 6-way replicated storage
across 3 AZs
• Scale to 64 TB and 15 Read
Replicas

How Ticketmaster uses RDS-Aurora
Quick Statistics
• Top 5 ecommerce site
• 26,000 Live Nation Events per year
• 530M Fans in more than 37 countries
• 465M ticket transactions annually
• 1B+ unique visits to web front end
• 400K concert tickets sold in a morning
Account Manager migration
• MySQL 5.6 to Aurora
• 12 MySQL servers to 5 Aurora nodes
• Deployment time from 1-2h to 20 minutes
• Test environment build went from 1 week
to 30 minutes
Terraformer
• Infrastructure as code tool for
databases
With Aurora:
• Scale In/Out horizontally within
10 minutes, without downtime
• Scale Up/Down vertically with 30
seconds of failover downtime
https://aws.amazon.com/solutions/case-studies/ticketmaster/

• Start your first migration in 10 minutes or less
• Keep your apps running during the migration
• Replicate within, to, or from Amazon EC2 or RDS
• Move data to the same or a different database engine
AWS Database Migration Service

Process/
analyze
Analytics Types & Frameworks
Type Time to Answer Example Services/
Frameworks
Batch min - hrs • Reporting, BI
• ML Training
Amazon EMR
Amazon Redshift
Interactive sec • Data Exploration
• Data Forensics
Amazon EMR
Amazon Athena
Redshift Spectrum
Message ms - sec • Message processing Amazon SQS
Stream ms - sec • Alerts & Notifications
• Real-Time
Dashboards
Amazon EMR
Amazon Kinesis
AWS Lambda
Apache Storm
Machine Learning ms - min • Demand Forecast
• Recommendations
Amazon ML
Amazon EMR

Relational Columnar MPP data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
Process/
analyze
Batch Analytics – Amazon Redshift

Amazon Redshift architecture
Leader node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
Process/
analyze

Why migrate to Amazon Redshift?
100x faster
Scales from GBs to PBs
Analyze data without storage
constraints
10x cheaper
Easy to provision and operate
Higher productivity
10x faster
No programming
Standard interfaces and
integration to leverage BI tools,
machine learning, streaming
Transactional database MPP database Hadoop
Process/
analyze

Migration from Oracle @
Boingo Wireless
2000+ Commercial Wi-Fi locations
1 million+ Hotspots
90M+ ad engagements
100+ countries
Legacy DW: Oracle 11g based DW
Before migration:
Rapid data growth slowed
analytics
Low IOPS, limited memory,
vertical scaling
Admin overhead
Expensive (license, h/w, support)
After migration:
180x performance improvement
7x cost savings
Process/
analyze
https://aws.amazon.com/solutions/case-studies/boingo-wireless/

0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Exadata SAP
HANA
Redshift
$400,000
$300,000
$55,000
7,200
2,700
15 15
Query
Performance
Data Load
Performance
1 year of data
1 million records
Latencyinseconds
RedshiftExisting System
7X cheaper than Oracle Exadata 180X faster than Oracle database
Migration from Oracle @ Boingo Wireless
Process/
analyze

Interactive Analytics
Amazon Athena
• Managed query service for
Amazon S3 data
• Zero Spin up time
• Presto and Hive based
• No data load, ETL
• CSV, Web Log, TSV, JSON,
Parquet, ORC, AVRO
support
• JDBC Driver
Redshift Spectrum
• Exabyte-scale queries on
S3 directly from Redshift
• High concurrency, elastic
and highly available
• No data load, ETL
• Full Redshift SQL support
• Redshift optimizations
• JDBC/ODBC support
Process/
analyze

Real-Time Analytics with
MLB Statcast
• Ball position sampled
2000 times/second
• Player position sampled
30 times/second
• 12 second time-to-
answer
• 7TB of data generated
per game
Process/
analyze
https://aws.amazon.com/solutions/case-studies/major-league-baseball-mlbam/

Personalized ranking,
page generation,
search, similarity, ratings
2016: Launched in140
new countries,
simultaneously
Recommendations & Ranking
At NetflixProcess/
analyze

What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
Data Integration Partners
Reduce the effort to move, cleanse, synchronize,
manage, and automatize data related processes.
AWS Glue
AWS Glue is a fully managed ETL service that makes
it easy to understand your data sources, prepare the
data, and move it reliably between data stores
Preview
Process/
analyze
Store Process/
analyze
ETL

IDE
Apps & Services
API
Applications & API
Analysis and visualization
Notebooks
IDE
Business
users
Amazon QuickSight
Analysis&visualizationNotebooks
Data scientist,
developers
Data Consumption & Visualization
Consume/
visualize

Business Users
QuickSight API
Data prep Metadata SuggestionsConnectors SPICE
QuickSight UI
Mobile devices Web browsers
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
EMR
Amazon
Redshift
Amazon RDSFiles Apps
Direct connect
JDBC/ODBC
On-premises data
Amazon
Athena
Amazon Quicksight
Consume/
visualize
Partner BI Products

Real-Time Analytics and ML - DataXu
Who
• Spun out of MIT Labs
• A petabyte scale digital
marketing platform
• One of the fastest growing
companies in Inc. 5000
What
• Help world’s most valuable
brands understand and
engage with their consumer
• Maximize ROI
Quick Statistics
• 2M+ bid requests per second
• Billions of impressions,
Petabytes of data
• 180+TB logs per day
• 2PB data analyzed daily
• 3000+ servers powering the
platform
• 13 regions, 24x7
https://aws.amazon.com/solutions/case-studies/dataxu/

Real-Time Ad Bidding Cycle
~10ms

DataXu Data Flow (Batch Layer)
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Streams
Advanced Analytics
(Third Party)
Reporting Tools
(Third Party)Amazon
Machine
Learning
S3All Data
(S3)
ETL
Attribution
Ecosystem of tools and services
Amazon
Athena

Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data Lake
Amazon EMR
Amazon
Kinesis
Amazon Redshift
Answers &
insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting
https://aws.amazon.com/solutions/case-studies/redfin/

Data Marts
(Amazon
Redshift)
Query Cluster
(EMR)
Query Cluster
(EMR)
Auto Scaling
EC2
Analytics
App
Normalization
ETL Clusters
(EMR)
Batch Analytic
Clusters
(EMR)
Ad Hoc Query
Cluster (EMR)
Auto Scaling
EC2
Analytics
App
Users Data
Providers
Auto Scaling
EC2
Data
Ingestion
Services
Optimization
ETL Clusters
(EMR)
Shared Metastore
(RDS)
Query Optimized
(S3)
Auto Scaling EC2
Data
Catalog
& Lineage
Services
Reference Data
(RDS)
Shared Data Services
Auto Scaling
EC2
Cluster Mgt
& Workflow
Services
Source of
Truth (S3)
5+ PB Data Lake
Up to 75 billion market events analyzed per day
https://aws.amazon.com/solutions/case-studies/finra/

Data Lake Design
https://aws.amazon.com/answers/big-data/data-lake-solution/

Amazon SQS apps
Streaming
KCL
apps
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon
Machine Learning
Presto
Amazon
EMR
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
SearchSQLNoSQLCacheFileMessageStream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
LoggingIoTApplicationsTransportMessaging
ETL
BatchMessageInteractiveStreamML
Amazon EMR
AWS Lambda
Amazon Kinesis
Analytics
Amazon Athena

Thank you!
Antoine Généreux, Solutions Architect, AWS Canada

Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Data Characteristics: Hot, Warm, Cold

Which Stream/Message Storage Should I Use?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
SQS
(Standard)
Amazon SQS
(FIFO)
AWS managed Yes Yes Yes No Yes Yes
Guaranteed ordering Yes Yes No Yes No Yes
Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once
Data retention period 24 hours 7 days N/A Configurable 14 days 14 days
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ
Scale /
throughput
No limit /
~ table IOPS
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS /
queue
Parallel consumption Yes Yes No Yes No No
Stream MapReduce Yes Yes N/A Yes N/A N/A
Row/object size 400 KB 1 MB Destination
row/object size
Configurable 256 KB 256 KB
Cost Higher (table
cost)
Low Low Low (+admin) Low-medium Low-medium
Hot Warm

Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
ES
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
B-KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Which Data Store Should I Use?

Which Stream/Message Processing Technology Should I Use?
Amazon
EMR (Spark
Streaming)
Apache
Storm
KCL Application Amazon Kinesis
Analytics
AWS
Lambda
Amazon SQS
Application
AWS
managed
Yes (Amazon
EMR)
No (Do it
yourself)
No ( EC2 + Auto
Scaling)
Yes Yes No (EC2 + Auto
Scaling)
Serverless No No No Yes Yes No
Scale /
throughput
No limits /
~ nodes
No limits /
~ nodes
No limits /
~ nodes
Up to 8 KPU /
automatic
No limits /
automatic
No limits /
~ nodes
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java,
Python,
Scala
Almost any
language via
Thrift
Java, others via
MultiLangDaemon
ANSI SQL with
extensions
Node.js,
Java,
Python
AWS SDK
languages (Java,
.NET, Python, …)
Uses Multistage
processing
Multistage
processing
Single stage
processing
Multistage
processing
Simple
event-based
triggers
Simple event
based triggers
Reliability KCL and
Spark
checkpoints
Framework
managed
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by
AWS
Lambda
Managed by SQS
Visibility Timeout

Which Analysis Tool Should I Use?
Amazon Redshift Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Ad-hoc Interactive
Queries
Interactive
Query
General purpose
(iterative ML, RT, ..)
Batch
Scale/throughput ~Nodes Automatic / No limits ~ Nodes
AWS Managed
Service
Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3, HDFS
Optimization Columnar storage, data
compression, and zone
maps
CSV, TSV, JSON,
Parquet, ORC, Apache
Web log
Framework dependent
Metadata Amazon Redshift managed Athena Catalog
Manager
Hive Meta-store
BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom)
Access controls Users, groups, and access
controls
AWS IAM Integration with LDAP
UDF support Yes (Scalar) No Yes
Slow

Database and Analytics on the AWS Cloud

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Database and Analytics on the AWS Cloud

Semelhante a Database and Analytics on the AWS Cloud (20)

Mais de Amazon Web Services

Mais de Amazon Web Services (20)

Último

Último (20)

Database and Analytics on the AWS Cloud