SlideShare uma empresa Scribd logo
1 de 59
Baixar para ler offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Antoine Généreux, Solutions Architect, AWS Canada
May 24, 2017
Databases and Analytics on
the AWS Cloud
What is Big Data?
When your data sets become so large and diverse
that you have to start innovating around how to
collect, store, process, analyze and share them
How do I absorb multiple sources of data?
How do I store my data?
How do I control access to my data?
How do I keep track of the changes to my data?
How do I know I’m picking the right analysis tool?
How do I know I’m asking the right questions with my data?
How can I make reporting easier and cheaper?
Is there an easier way to do ETL?
How do I move away from my RDBMS?
How do I translate analysis to business answers?
How do I future-proof my architecture?
How do I get started?
How do I keep my costs down?
How do I give access to my different teams?
Big Data Questions and Challenges
1990 2000 2010 2020
Generated Data
Available for Analysis
Sources:
Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011
IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares
Data Volume
Year
The Dark Data Problem
Most generated data is unavailable for analysis
“Data is the New Oil” - Brian Krzanich, CEO Intel
Evolution of Analytics
Batch analytics
Real-time analytics
Predictive/Adaptive analytics
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Amazon Kinesis
Streams app
Lambda Amazon ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Kinesis
Analytics
A plethora of tools
Simplify Big Data Processing
Time to answer (Latency)
Throughput
Cost
Ingest/
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers &
insights
Architectural Principles
Build decoupled systems
• Data → Store → Process → Store →Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views (schema-on-read)
Be cost-conscious
• Big data ≠ big cost
Building Decoupled
Systems
RECORDS
In-memory data structures
Database records
DOCUMENTS Search documents
Log files
Messages
Data streams
Transactions
Files
Events
Types of Data
Ingest/
Collect
1 4
0 9
5
Web apps
Mobile apps
Data center apps
Application Logging
Bulk Data Transfer
Messaging
Devices
Sensors
FILES
MESSAGES
STREAMS
Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency μs-ms ms-sec min-hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Temperature
Ingest/
Collect
1 4
0 9
5
Database SQL & NoSQL databases
Search Search engines
File/Object
store
File systems
Queue Message queues
Stream
storage
Pub/sub message queues
In-memory Caches, data structure servers
Types of Data Stores
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Store
Why Stream Storage and Queues?
• Parallel consumption*
• Preserve client ordering*
4 4 3 3 2 2 1 1
4 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Producer 1
shard 1 / partition 1
shard 2 / partition 2
Consumer 1
Count of
red = 4
Count of
violet = 4
Consumer 2
Count of
blue = 4
Count of
green = 4
Producer 2
Producer 3
Producer n
Key = violet
DynamoDB stream Amazon Kinesis stream Kafka topic
Store • Decouple producers & consumers
• Collect multiple streams
• Persistent buffer
• Streaming MapReduce
Database
Search
File store
Queue
Stream
storage
In-memory
Object Storage
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Store
Amazon S3
Why is S3 Good for Analytics?
Amazon S3 Core Features
• Unlimited number of objects and volume of data
• Very high bandwidth – no aggregate throughput limit
• Designed for 99.99% availability – can tolerate zone failure
• Designed for 99.999999999% durability
• 3x Data replication included with the service
Store
Why is S3 Good for Analytics?
Decoupling Storage and Compute
• Natively supported by big data frameworks (Spark, Hive,
Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot
Instances (Auto-Scaling, Spot Blocks, Spot Fleets)
• Multiple & heterogeneous analysis clusters can use the
same data
Store
Amazon S3 Additional Features
• Native support for versioning, object tagging
• Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle
policies
• Tiering optimization through S3 Analytics
• Secure – SSL, client/server-side encryption at rest and
granular access policies
• Low cost (0.025$/GB in Canada)
Why is S3 Good for Analytics?
Store
Database
Search
File store
Queue
Stream
storage
In-memory
File Storage
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Store
Choosing the Right
Tool
Database Access Anti-Pattern
Use the Right Tool for the Job
Search
Amazon Elasticsearch
Service
In-memory
Amazon ElastiCache
Redis
Memcached
SQL
Amazon Aurora
Amazon RDS
MySQL
PostgreSQL
Oracle
SQL Server
MariaDB
NoSQL
Amazon DynamoDB
Cassandra
HBase
MongoDB
Microsecond Real-Time Performance
Fully Managed
Redis Automatic Failover = NoOps
Enhanced Redis Engine
No Cross-AZ Data Transfer Costs
Easy to Deploy, Use and Monitor
Open-Source Compatible
Amazon
ElastiCache
Fully managed NoSQL
Document / Key-Value store
Single-digit millisecond latency
Massive and seamless scalability
Event-driven programming
DynamoDB Accelerator (DAX)
Amazon
DynamoDB
Preview
Amazon
RDS
Automated backups (with point-in-time recovery)
Cross-region snapshot copies
Automated patch management
Automated Multi-AZ replication
Scale up / Scale down instance types
Scalable storage on demand
“License included” and BYOL models
Amazon Aurora: MySQL and PostgreSQL-compatible
SQL
Trans-
actions
AZ 1 AZ 2 AZ 3
Caching
Amazon
S3
• 5x faster than MySQL on
same hardware
• SysBench: 100 K writes/sec
and 500 K reads/sec
• Designed for 99.99%
availability
• 6-way replicated storage
across 3 AZs
• Scale to 64 TB and 15 Read
Replicas
How Ticketmaster uses RDS-Aurora
Quick Statistics
• Top 5 ecommerce site
• 26,000 Live Nation Events per year
• 530M Fans in more than 37 countries
• 465M ticket transactions annually
• 1B+ unique visits to web front end
• 400K concert tickets sold in a morning
Account Manager migration
• MySQL 5.6 to Aurora
• 12 MySQL servers to 5 Aurora nodes
• Deployment time from 1-2h to 20 minutes
• Test environment build went from 1 week
to 30 minutes
Terraformer
• Infrastructure as code tool for
databases
With Aurora:
• Scale In/Out horizontally within
10 minutes, without downtime
• Scale Up/Down vertically with 30
seconds of failover downtime
https://aws.amazon.com/solutions/case-studies/ticketmaster/
• Start your first migration in 10 minutes or less
• Keep your apps running during the migration
• Replicate within, to, or from Amazon EC2 or RDS
• Move data to the same or a different database engine
AWS Database Migration Service
Amazon Elasticsearch Service
Processing and
Visualization
Process/
analyze
Analytics Types & Frameworks
Type Time to Answer Example Services/
Frameworks
Batch min - hrs • Reporting, BI
• ML Training
Amazon EMR
Amazon Redshift
Interactive sec • Data Exploration
• Data Forensics
Amazon EMR
Amazon Athena
Redshift Spectrum
Message ms - sec • Message processing Amazon SQS
Stream ms - sec • Alerts & Notifications
• Real-Time
Dashboards
Amazon EMR
Amazon Kinesis
AWS Lambda
Apache Storm
Machine Learning ms - min • Demand Forecast
• Recommendations
Amazon ML
Amazon EMR
Relational Columnar MPP data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
Process/
analyze
Batch Analytics – Amazon Redshift
Amazon Redshift architecture
Leader node
Simple SQL endpoint
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at just $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
Ingestion/backup
Backup
Restore
JDBC/ODBC
10 GigE
(HPC)
Process/
analyze
Why migrate to Amazon Redshift?
100x faster
Scales from GBs to PBs
Analyze data without storage
constraints
10x cheaper
Easy to provision and operate
Higher productivity
10x faster
No programming
Standard interfaces and
integration to leverage BI tools,
machine learning, streaming
Transactional database MPP database Hadoop
Process/
analyze
Migration from Oracle @
Boingo Wireless
2000+ Commercial Wi-Fi locations
1 million+ Hotspots
90M+ ad engagements
100+ countries
Legacy DW: Oracle 11g based DW
Before migration:
Rapid data growth slowed
analytics
Low IOPS, limited memory,
vertical scaling
Admin overhead
Expensive (license, h/w, support)
After migration:
180x performance improvement
7x cost savings
Process/
analyze
https://aws.amazon.com/solutions/case-studies/boingo-wireless/
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Exadata SAP
HANA
Redshift
$400,000
$300,000
$55,000
7,200
2,700
15 15
Query
Performance
Data Load
Performance
1 year of data
1 million records
Latencyinseconds
RedshiftExisting System
7X cheaper than Oracle Exadata 180X faster than Oracle database
Migration from Oracle @ Boingo Wireless
Process/
analyze
Interactive Analytics
Amazon Athena
• Managed query service for
Amazon S3 data
• Zero Spin up time
• Presto and Hive based
• No data load, ETL
• CSV, Web Log, TSV, JSON,
Parquet, ORC, AVRO
support
• JDBC Driver
Redshift Spectrum
• Exabyte-scale queries on
S3 directly from Redshift
• High concurrency, elastic
and highly available
• No data load, ETL
• Full Redshift SQL support
• Redshift optimizations
• JDBC/ODBC support
Process/
analyze
Real-Time Analytics with
MLB Statcast
• Ball position sampled
2000 times/second
• Player position sampled
30 times/second
• 12 second time-to-
answer
• 7TB of data generated
per game
Process/
analyze
https://aws.amazon.com/solutions/case-studies/major-league-baseball-mlbam/
Personalized ranking,
page generation,
search, similarity, ratings
2016: Launched in140
new countries,
simultaneously
Recommendations & Ranking
At NetflixProcess/
analyze
What About ETL?
https://aws.amazon.com/big-data/partner-solutions/
Data Integration Partners
Reduce the effort to move, cleanse, synchronize,
manage, and automatize data related processes.
AWS Glue
AWS Glue is a fully managed ETL service that makes
it easy to understand your data sources, prepare the
data, and move it reliably between data stores
Preview
Process/
analyze
Store Process/
analyze
ETL
IDE
Apps & Services
API
Applications & API
Analysis and visualization
Notebooks
IDE
Business
users
Amazon QuickSight
Analysis&visualizationNotebooks
Data scientist,
developers
Data Consumption & Visualization
Consume/
visualize
Business Users
QuickSight API
Data prep Metadata SuggestionsConnectors SPICE
QuickSight UI
Mobile devices Web browsers
Amazon
S3
Amazon
Kinesis
Amazon
DynamoDB
Amazon
EMR
Amazon
Redshift
Amazon RDSFiles Apps
Direct connect
JDBC/ODBC
On-premises data
Amazon
Athena
Amazon Quicksight
Consume/
visualize
Partner BI Products
Putting it all Together
Real-Time Analytics and ML - DataXu
Who
• Spun out of MIT Labs
• A petabyte scale digital
marketing platform
• One of the fastest growing
companies in Inc. 5000
What
• Help world’s most valuable
brands understand and
engage with their consumer
• Maximize ROI
Quick Statistics
• 2M+ bid requests per second
• Billions of impressions,
Petabytes of data
• 180+TB logs per day
• 2PB data analyzed daily
• 3000+ servers powering the
platform
• 13 regions, 24x7
https://aws.amazon.com/solutions/case-studies/dataxu/
Real-Time Ad Bidding Cycle
~10ms
DataXu Data Flow (Batch Layer)
CDN
Real Time
Bidding
Retargeting
Platform
Amazon
Kinesis
Streams
Advanced Analytics
(Third Party)
Reporting Tools
(Third Party)Amazon
Machine
Learning
S3All Data
(S3)
ETL
Attribution
Ecosystem of tools and services
Amazon
Athena
Ingest/
Collect
Consume/
visualize
Store
Process/
analyze
Data
1 4
0 9
5
Amazon S3
Data Lake
Amazon EMR
Amazon
Kinesis
Amazon Redshift
Answers &
insights
Hot HomesUsers
Properties
Agents
User Profile
Recommendation
Hot Homes
Similar Homes
Agent Follow-up
Agent Scorecard
Marketing
A/B Testing
Real Time Data
…
Amazon
DynamoDB
BI / Reporting
https://aws.amazon.com/solutions/case-studies/redfin/
Data Marts
(Amazon
Redshift)
Query Cluster
(EMR)
Query Cluster
(EMR)
Auto Scaling
EC2
Analytics
App
Normalization
ETL Clusters
(EMR)
Batch Analytic
Clusters
(EMR)
Ad Hoc Query
Cluster (EMR)
Auto Scaling
EC2
Analytics
App
Users Data
Providers
Auto Scaling
EC2
Data
Ingestion
Services
Optimization
ETL Clusters
(EMR)
Shared Metastore
(RDS)
Query Optimized
(S3)
Auto Scaling EC2
Data
Catalog
& Lineage
Services
Reference Data
(RDS)
Shared Data Services
Auto Scaling
EC2
Cluster Mgt
& Workflow
Services
Source of
Truth (S3)
5+ PB Data Lake
Up to 75 billion market events analyzed per day
https://aws.amazon.com/solutions/case-studies/finra/
Data Lake Design
https://aws.amazon.com/answers/big-data/data-lake-solution/
Recap
Architectural Principles
Build decoupled systems
• Data → Store → Process → Store →Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable logs, materialized views (schema-on-read)
Be cost-conscious
• Big data ≠ big cost
Amazon SQS apps
Streaming
KCL
apps
Amazon Redshift
COLLECT STORE CONSUMEPROCESS / ANALYZE
Amazon
Machine Learning
Presto
Amazon
EMR
Amazon Elasticsearch
Service
Apache Kafka
Amazon SQS
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
Amazon DynamoDB
Streams
HotHotWarm
FastSlowFast
SearchSQLNoSQLCacheFileMessageStream
Amazon EC2
Amazon EC2
Mobile apps
Web apps
Devices
Messaging
Message
Sensors &
IoT platforms
AWS IoT
Data centers
AWS Direct
Connect
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
RECORDS
DOCUMENTS
FILES
MESSAGES
STREAMS
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
LoggingIoTApplicationsTransportMessaging
ETL
BatchMessageInteractiveStreamML
Amazon EMR
AWS Lambda
Amazon Kinesis
Analytics
Amazon Athena
Thank you!
Antoine Généreux, Solutions Architect, AWS Canada
Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Characteristics: Hot, Warm, Cold
Which Stream/Message Storage Should I Use?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
SQS
(Standard)
Amazon SQS
(FIFO)
AWS managed Yes Yes Yes No Yes Yes
Guaranteed ordering Yes Yes No Yes No Yes
Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once
Data retention period 24 hours 7 days N/A Configurable 14 days 14 days
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ
Scale /
throughput
No limit /
~ table IOPS
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
300 TPS /
queue
Parallel consumption Yes Yes No Yes No No
Stream MapReduce Yes Yes N/A Yes N/A N/A
Row/object size 400 KB 1 MB Destination
row/object size
Configurable 256 KB 256 KB
Cost Higher (table
cost)
Low Low Low (+admin) Low-medium Low-medium
Hot Warm
Amazon ElastiCache Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
ES
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
B-KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
Which Data Store Should I Use?
Which Stream/Message Processing Technology Should I Use?
Amazon
EMR (Spark
Streaming)
Apache
Storm
KCL Application Amazon Kinesis
Analytics
AWS
Lambda
Amazon SQS
Application
AWS
managed
Yes (Amazon
EMR)
No (Do it
yourself)
No ( EC2 + Auto
Scaling)
Yes Yes No (EC2 + Auto
Scaling)
Serverless No No No Yes Yes No
Scale /
throughput
No limits /
~ nodes
No limits /
~ nodes
No limits /
~ nodes
Up to 8 KPU /
automatic
No limits /
automatic
No limits /
~ nodes
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java,
Python,
Scala
Almost any
language via
Thrift
Java, others via
MultiLangDaemon
ANSI SQL with
extensions
Node.js,
Java,
Python
AWS SDK
languages (Java,
.NET, Python, …)
Uses Multistage
processing
Multistage
processing
Single stage
processing
Multistage
processing
Simple
event-based
triggers
Simple event
based triggers
Reliability KCL and
Spark
checkpoints
Framework
managed
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by
AWS
Lambda
Managed by SQS
Visibility Timeout
Which Analysis Tool Should I Use?
Amazon Redshift Amazon Athena Amazon EMR
Presto Spark Hive
Use case Optimized for data
warehousing
Ad-hoc Interactive
Queries
Interactive
Query
General purpose
(iterative ML, RT, ..)
Batch
Scale/throughput ~Nodes Automatic / No limits ~ Nodes
AWS Managed
Service
Yes Yes, Serverless Yes
Storage Local storage Amazon S3 Amazon S3, HDFS
Optimization Columnar storage, data
compression, and zone
maps
CSV, TSV, JSON,
Parquet, ORC, Apache
Web log
Framework dependent
Metadata Amazon Redshift managed Athena Catalog
Manager
Hive Meta-store
BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom)
Access controls Users, groups, and access
controls
AWS IAM Integration with LDAP
UDF support Yes (Scalar) No Yes
Slow

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017 Full Stack Analytics on AWS - AWS Summit Cape Town 2017
Full Stack Analytics on AWS - AWS Summit Cape Town 2017
 
Building a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWSBuilding a Data Processing Pipeline on AWS
Building a Data Processing Pipeline on AWS
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
AWS re:Invent 2016: How to Build a Big Data Analytics Data Lake (LFS303)
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
AWS Big Data Solution Days
AWS Big Data Solution DaysAWS Big Data Solution Days
AWS Big Data Solution Days
 
ENT314 Automate Best Practices and Operational Health for Your AWS Resources
ENT314 Automate Best Practices and Operational Health for Your AWS ResourcesENT314 Automate Best Practices and Operational Health for Your AWS Resources
ENT314 Automate Best Practices and Operational Health for Your AWS Resources
 
Are you Well-Architected? - AWS Online Tech Talks
Are you Well-Architected? - AWS Online Tech TalksAre you Well-Architected? - AWS Online Tech Talks
Are you Well-Architected? - AWS Online Tech Talks
 
AWS Reinvent Recap 2018
AWS Reinvent Recap 2018 AWS Reinvent Recap 2018
AWS Reinvent Recap 2018
 
Building your First Big Data Application on AWS
Building your First Big Data Application on AWSBuilding your First Big Data Application on AWS
Building your First Big Data Application on AWS
 
Welcome Keynote - AWS Summit Stockholm
Welcome Keynote - AWS Summit Stockholm Welcome Keynote - AWS Summit Stockholm
Welcome Keynote - AWS Summit Stockholm
 
Structured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWSStructured, Unstructured and Streaming Big Data on the AWS
Structured, Unstructured and Streaming Big Data on the AWS
 
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
AWS re:Invent 2016: What’s New with Amazon Redshift (BDA304)
 
Scaling Ideas: Accelerating Research with AWS - Technical 301
Scaling Ideas: Accelerating Research with AWS - Technical 301Scaling Ideas: Accelerating Research with AWS - Technical 301
Scaling Ideas: Accelerating Research with AWS - Technical 301
 
Creating a Data Driven Culture with Amazon QuickSight - Technical 201
Creating a Data Driven Culture with Amazon QuickSight - Technical 201Creating a Data Driven Culture with Amazon QuickSight - Technical 201
Creating a Data Driven Culture with Amazon QuickSight - Technical 201
 
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
 	  NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing 	  NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
NEW LAUNCH! Introducing AWS Batch: Easy and efficient batch computing
 
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
Best Practices Using Big Data on AWS | AWS Public Sector Summit 2017
 
The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017The Power of Big Data - AWS Summit Bahrain 2017
The Power of Big Data - AWS Summit Bahrain 2017
 
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
AWS re:Invent 2016: High Performance Cinematic Production in the Cloud (MAE304)
 
Your First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon ElishaYour First Data Lake on AWS_Simon Elisha
Your First Data Lake on AWS_Simon Elisha
 

Semelhante a Database and Analytics on the AWS Cloud

Semelhante a Database and Analytics on the AWS Cloud (20)

AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel AvivBig Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
Big Data and Architectural Patterns on AWS - Pop-up Loft Tel Aviv
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
Big Data adoption success using AWS Big Data Services - Pop-up Loft TLV 2017
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
AWS Enterprise Summit Netherlands - Big Data Architectural Patterns & Best Pr...
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS(BDT310) Big Data Architectural Patterns and Best Practices on AWS
(BDT310) Big Data Architectural Patterns and Best Practices on AWS
 
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Big Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best PracticesBig Data Architectural Patterns and Best Practices
Big Data Architectural Patterns and Best Practices
 

Mais de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 

Último (20)

The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 

Database and Analytics on the AWS Cloud

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Antoine Généreux, Solutions Architect, AWS Canada May 24, 2017 Databases and Analytics on the AWS Cloud
  • 3. What is Big Data? When your data sets become so large and diverse that you have to start innovating around how to collect, store, process, analyze and share them
  • 4. How do I absorb multiple sources of data? How do I store my data? How do I control access to my data? How do I keep track of the changes to my data? How do I know I’m picking the right analysis tool? How do I know I’m asking the right questions with my data? How can I make reporting easier and cheaper? Is there an easier way to do ETL? How do I move away from my RDBMS? How do I translate analysis to business answers? How do I future-proof my architecture? How do I get started? How do I keep my costs down? How do I give access to my different teams? Big Data Questions and Challenges
  • 5. 1990 2000 2010 2020 Generated Data Available for Analysis Sources: Gartner: User Survey Analysis: Key Trends Shaping the Future of Data Center Infrastructure Through 2011 IDC: Worldwide Business Analytics Software 2012–2016 Forecast and 2011 Vendor Shares Data Volume Year The Dark Data Problem Most generated data is unavailable for analysis
  • 6. “Data is the New Oil” - Brian Krzanich, CEO Intel
  • 7. Evolution of Analytics Batch analytics Real-time analytics Predictive/Adaptive analytics
  • 8. Amazon Glacier S3 DynamoDB RDS EMR Amazon Redshift Data Pipeline Amazon Kinesis Amazon Kinesis Streams app Lambda Amazon ML SQS ElastiCache DynamoDB Streams Amazon Elasticsearch Service Amazon Kinesis Analytics A plethora of tools
  • 9. Simplify Big Data Processing Time to answer (Latency) Throughput Cost Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & insights
  • 10. Architectural Principles Build decoupled systems • Data → Store → Process → Store →Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs, materialized views (schema-on-read) Be cost-conscious • Big data ≠ big cost
  • 12. RECORDS In-memory data structures Database records DOCUMENTS Search documents Log files Messages Data streams Transactions Files Events Types of Data Ingest/ Collect 1 4 0 9 5 Web apps Mobile apps Data center apps Application Logging Bulk Data Transfer Messaging Devices Sensors FILES MESSAGES STREAMS
  • 13. Hot Warm Cold Volume MB–GB GB–TB PB–EB Item size B–KB KB–MB KB–TB Latency μs-ms ms-sec min-hrs Durability Low–high High Very high Request rate Very high High Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data Temperature Ingest/ Collect 1 4 0 9 5
  • 14. Database SQL & NoSQL databases Search Search engines File/Object store File systems Queue Message queues Stream storage Pub/sub message queues In-memory Caches, data structure servers Types of Data Stores RECORDS DOCUMENTS FILES MESSAGES STREAMS Store
  • 15. Why Stream Storage and Queues? • Parallel consumption* • Preserve client ordering* 4 4 3 3 2 2 1 1 4 3 2 1 4 3 2 1 4 3 2 1 4 3 2 1 4 4 3 3 2 2 1 1 Producer 1 shard 1 / partition 1 shard 2 / partition 2 Consumer 1 Count of red = 4 Count of violet = 4 Consumer 2 Count of blue = 4 Count of green = 4 Producer 2 Producer 3 Producer n Key = violet DynamoDB stream Amazon Kinesis stream Kafka topic Store • Decouple producers & consumers • Collect multiple streams • Persistent buffer • Streaming MapReduce
  • 17. Why is S3 Good for Analytics? Amazon S3 Core Features • Unlimited number of objects and volume of data • Very high bandwidth – no aggregate throughput limit • Designed for 99.99% availability – can tolerate zone failure • Designed for 99.999999999% durability • 3x Data replication included with the service Store
  • 18. Why is S3 Good for Analytics? Decoupling Storage and Compute • Natively supported by big data frameworks (Spark, Hive, Presto, etc.) • No need to run compute clusters for storage (unlike HDFS) • Can run transient Hadoop clusters & Amazon EC2 Spot Instances (Auto-Scaling, Spot Blocks, Spot Fleets) • Multiple & heterogeneous analysis clusters can use the same data Store
  • 19. Amazon S3 Additional Features • Native support for versioning, object tagging • Tiered-storage (Standard, IA, Amazon Glacier) via life-cycle policies • Tiering optimization through S3 Analytics • Secure – SSL, client/server-side encryption at rest and granular access policies • Low cost (0.025$/GB in Canada) Why is S3 Good for Analytics? Store
  • 23. Use the Right Tool for the Job Search Amazon Elasticsearch Service In-memory Amazon ElastiCache Redis Memcached SQL Amazon Aurora Amazon RDS MySQL PostgreSQL Oracle SQL Server MariaDB NoSQL Amazon DynamoDB Cassandra HBase MongoDB
  • 24. Microsecond Real-Time Performance Fully Managed Redis Automatic Failover = NoOps Enhanced Redis Engine No Cross-AZ Data Transfer Costs Easy to Deploy, Use and Monitor Open-Source Compatible Amazon ElastiCache
  • 25. Fully managed NoSQL Document / Key-Value store Single-digit millisecond latency Massive and seamless scalability Event-driven programming DynamoDB Accelerator (DAX) Amazon DynamoDB Preview
  • 26. Amazon RDS Automated backups (with point-in-time recovery) Cross-region snapshot copies Automated patch management Automated Multi-AZ replication Scale up / Scale down instance types Scalable storage on demand “License included” and BYOL models
  • 27. Amazon Aurora: MySQL and PostgreSQL-compatible SQL Trans- actions AZ 1 AZ 2 AZ 3 Caching Amazon S3 • 5x faster than MySQL on same hardware • SysBench: 100 K writes/sec and 500 K reads/sec • Designed for 99.99% availability • 6-way replicated storage across 3 AZs • Scale to 64 TB and 15 Read Replicas
  • 28. How Ticketmaster uses RDS-Aurora Quick Statistics • Top 5 ecommerce site • 26,000 Live Nation Events per year • 530M Fans in more than 37 countries • 465M ticket transactions annually • 1B+ unique visits to web front end • 400K concert tickets sold in a morning Account Manager migration • MySQL 5.6 to Aurora • 12 MySQL servers to 5 Aurora nodes • Deployment time from 1-2h to 20 minutes • Test environment build went from 1 week to 30 minutes Terraformer • Infrastructure as code tool for databases With Aurora: • Scale In/Out horizontally within 10 minutes, without downtime • Scale Up/Down vertically with 30 seconds of failover downtime https://aws.amazon.com/solutions/case-studies/ticketmaster/
  • 29. • Start your first migration in 10 minutes or less • Keep your apps running during the migration • Replicate within, to, or from Amazon EC2 or RDS • Move data to the same or a different database engine AWS Database Migration Service
  • 32. Process/ analyze Analytics Types & Frameworks Type Time to Answer Example Services/ Frameworks Batch min - hrs • Reporting, BI • ML Training Amazon EMR Amazon Redshift Interactive sec • Data Exploration • Data Forensics Amazon EMR Amazon Athena Redshift Spectrum Message ms - sec • Message processing Amazon SQS Stream ms - sec • Alerts & Notifications • Real-Time Dashboards Amazon EMR Amazon Kinesis AWS Lambda Apache Storm Machine Learning ms - min • Demand Forecast • Recommendations Amazon ML Amazon EMR
  • 33. Relational Columnar MPP data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift Process/ analyze Batch Analytics – Amazon Redshift
  • 34. Amazon Redshift architecture Leader node Simple SQL endpoint Stores metadata Optimizes query plan Coordinates query execution Compute nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at just $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB Ingestion/backup Backup Restore JDBC/ODBC 10 GigE (HPC) Process/ analyze
  • 35. Why migrate to Amazon Redshift? 100x faster Scales from GBs to PBs Analyze data without storage constraints 10x cheaper Easy to provision and operate Higher productivity 10x faster No programming Standard interfaces and integration to leverage BI tools, machine learning, streaming Transactional database MPP database Hadoop Process/ analyze
  • 36. Migration from Oracle @ Boingo Wireless 2000+ Commercial Wi-Fi locations 1 million+ Hotspots 90M+ ad engagements 100+ countries Legacy DW: Oracle 11g based DW Before migration: Rapid data growth slowed analytics Low IOPS, limited memory, vertical scaling Admin overhead Expensive (license, h/w, support) After migration: 180x performance improvement 7x cost savings Process/ analyze https://aws.amazon.com/solutions/case-studies/boingo-wireless/
  • 37. 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 Exadata SAP HANA Redshift $400,000 $300,000 $55,000 7,200 2,700 15 15 Query Performance Data Load Performance 1 year of data 1 million records Latencyinseconds RedshiftExisting System 7X cheaper than Oracle Exadata 180X faster than Oracle database Migration from Oracle @ Boingo Wireless Process/ analyze
  • 38. Interactive Analytics Amazon Athena • Managed query service for Amazon S3 data • Zero Spin up time • Presto and Hive based • No data load, ETL • CSV, Web Log, TSV, JSON, Parquet, ORC, AVRO support • JDBC Driver Redshift Spectrum • Exabyte-scale queries on S3 directly from Redshift • High concurrency, elastic and highly available • No data load, ETL • Full Redshift SQL support • Redshift optimizations • JDBC/ODBC support Process/ analyze
  • 39. Real-Time Analytics with MLB Statcast • Ball position sampled 2000 times/second • Player position sampled 30 times/second • 12 second time-to- answer • 7TB of data generated per game Process/ analyze https://aws.amazon.com/solutions/case-studies/major-league-baseball-mlbam/
  • 40. Personalized ranking, page generation, search, similarity, ratings 2016: Launched in140 new countries, simultaneously Recommendations & Ranking At NetflixProcess/ analyze
  • 41. What About ETL? https://aws.amazon.com/big-data/partner-solutions/ Data Integration Partners Reduce the effort to move, cleanse, synchronize, manage, and automatize data related processes. AWS Glue AWS Glue is a fully managed ETL service that makes it easy to understand your data sources, prepare the data, and move it reliably between data stores Preview Process/ analyze Store Process/ analyze ETL
  • 42. IDE Apps & Services API Applications & API Analysis and visualization Notebooks IDE Business users Amazon QuickSight Analysis&visualizationNotebooks Data scientist, developers Data Consumption & Visualization Consume/ visualize
  • 43. Business Users QuickSight API Data prep Metadata SuggestionsConnectors SPICE QuickSight UI Mobile devices Web browsers Amazon S3 Amazon Kinesis Amazon DynamoDB Amazon EMR Amazon Redshift Amazon RDSFiles Apps Direct connect JDBC/ODBC On-premises data Amazon Athena Amazon Quicksight Consume/ visualize Partner BI Products
  • 44. Putting it all Together
  • 45. Real-Time Analytics and ML - DataXu Who • Spun out of MIT Labs • A petabyte scale digital marketing platform • One of the fastest growing companies in Inc. 5000 What • Help world’s most valuable brands understand and engage with their consumer • Maximize ROI Quick Statistics • 2M+ bid requests per second • Billions of impressions, Petabytes of data • 180+TB logs per day • 2PB data analyzed daily • 3000+ servers powering the platform • 13 regions, 24x7 https://aws.amazon.com/solutions/case-studies/dataxu/
  • 46. Real-Time Ad Bidding Cycle ~10ms
  • 47. DataXu Data Flow (Batch Layer) CDN Real Time Bidding Retargeting Platform Amazon Kinesis Streams Advanced Analytics (Third Party) Reporting Tools (Third Party)Amazon Machine Learning S3All Data (S3) ETL Attribution Ecosystem of tools and services Amazon Athena
  • 48. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Amazon S3 Data Lake Amazon EMR Amazon Kinesis Amazon Redshift Answers & insights Hot HomesUsers Properties Agents User Profile Recommendation Hot Homes Similar Homes Agent Follow-up Agent Scorecard Marketing A/B Testing Real Time Data … Amazon DynamoDB BI / Reporting https://aws.amazon.com/solutions/case-studies/redfin/
  • 49. Data Marts (Amazon Redshift) Query Cluster (EMR) Query Cluster (EMR) Auto Scaling EC2 Analytics App Normalization ETL Clusters (EMR) Batch Analytic Clusters (EMR) Ad Hoc Query Cluster (EMR) Auto Scaling EC2 Analytics App Users Data Providers Auto Scaling EC2 Data Ingestion Services Optimization ETL Clusters (EMR) Shared Metastore (RDS) Query Optimized (S3) Auto Scaling EC2 Data Catalog & Lineage Services Reference Data (RDS) Shared Data Services Auto Scaling EC2 Cluster Mgt & Workflow Services Source of Truth (S3) 5+ PB Data Lake Up to 75 billion market events analyzed per day https://aws.amazon.com/solutions/case-studies/finra/
  • 51. Recap
  • 52. Architectural Principles Build decoupled systems • Data → Store → Process → Store →Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable logs, materialized views (schema-on-read) Be cost-conscious • Big data ≠ big cost
  • 53. Amazon SQS apps Streaming KCL apps Amazon Redshift COLLECT STORE CONSUMEPROCESS / ANALYZE Amazon Machine Learning Presto Amazon EMR Amazon Elasticsearch Service Apache Kafka Amazon SQS Amazon Kinesis Streams Amazon Kinesis Firehose Amazon DynamoDB Amazon S3 Amazon ElastiCache Amazon RDS Amazon DynamoDB Streams HotHotWarm FastSlowFast SearchSQLNoSQLCacheFileMessageStream Amazon EC2 Amazon EC2 Mobile apps Web apps Devices Messaging Message Sensors & IoT platforms AWS IoT Data centers AWS Direct Connect AWS Import/Export Snowball Logging Amazon CloudWatch AWS CloudTrail RECORDS DOCUMENTS FILES MESSAGES STREAMS Amazon QuickSight Apps & Services Analysis&visualizationNotebooksIDEAPI LoggingIoTApplicationsTransportMessaging ETL BatchMessageInteractiveStreamML Amazon EMR AWS Lambda Amazon Kinesis Analytics Amazon Athena
  • 54. Thank you! Antoine Généreux, Solutions Architect, AWS Canada
  • 55. Hot Warm Cold Volume MB–GB GB–TB PB–EB Item size B–KB KB–MB KB–TB Latency ms ms, sec min, hrs Durability Low–high High Very high Request rate Very high High Low Cost/GB $$-$ $-¢¢ ¢ Hot data Warm data Cold data Data Characteristics: Hot, Warm, Cold
  • 56. Which Stream/Message Storage Should I Use? Amazon DynamoDB Streams Amazon Kinesis Streams Amazon Kinesis Firehose Apache Kafka Amazon SQS (Standard) Amazon SQS (FIFO) AWS managed Yes Yes Yes No Yes Yes Guaranteed ordering Yes Yes No Yes No Yes Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once Exactly-once Data retention period 24 hours 7 days N/A Configurable 14 days 14 days Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ 3 AZ Scale / throughput No limit / ~ table IOPS No limit / ~ shards No limit / automatic No limit / ~ nodes No limits / automatic 300 TPS / queue Parallel consumption Yes Yes No Yes No No Stream MapReduce Yes Yes N/A Yes N/A N/A Row/object size 400 KB 1 MB Destination row/object size Configurable 256 KB 256 KB Cost Higher (table cost) Low Low Low (+admin) Low-medium Low-medium Hot Warm
  • 57. Amazon ElastiCache Amazon DynamoDB Amazon RDS/Aurora Amazon ES Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec ms,sec,min (~ size) hrs Typical data stored GB GB–TBs (no limit) GB–TB (64 TB max) GB–TB MB–PB (no limit) GB–PB (no limit) Typical item size B-KB KB (400 KB max) KB (64 KB max) B-KB (2 GB max) KB-TB (5 TB max) GB (40 TB max) Request Rate High – very high Very high (no limit) High High Low – high (no limit) Very low Storage cost GB/month $$ ¢¢ ¢¢ ¢¢ ¢ ¢4/10 Durability Low - moderate Very high Very high High Very high Very high Availability High 2 AZ Very high 3 AZ Very high 3 AZ High 2 AZ Very high 3 AZ Very high 3 AZ Hot data Warm data Cold data Which Data Store Should I Use?
  • 58. Which Stream/Message Processing Technology Should I Use? Amazon EMR (Spark Streaming) Apache Storm KCL Application Amazon Kinesis Analytics AWS Lambda Amazon SQS Application AWS managed Yes (Amazon EMR) No (Do it yourself) No ( EC2 + Auto Scaling) Yes Yes No (EC2 + Auto Scaling) Serverless No No No Yes Yes No Scale / throughput No limits / ~ nodes No limits / ~ nodes No limits / ~ nodes Up to 8 KPU / automatic No limits / automatic No limits / ~ nodes Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ Programming languages Java, Python, Scala Almost any language via Thrift Java, others via MultiLangDaemon ANSI SQL with extensions Node.js, Java, Python AWS SDK languages (Java, .NET, Python, …) Uses Multistage processing Multistage processing Single stage processing Multistage processing Simple event-based triggers Simple event based triggers Reliability KCL and Spark checkpoints Framework managed Managed by KCL Managed by Amazon Kinesis Analytics Managed by AWS Lambda Managed by SQS Visibility Timeout
  • 59. Which Analysis Tool Should I Use? Amazon Redshift Amazon Athena Amazon EMR Presto Spark Hive Use case Optimized for data warehousing Ad-hoc Interactive Queries Interactive Query General purpose (iterative ML, RT, ..) Batch Scale/throughput ~Nodes Automatic / No limits ~ Nodes AWS Managed Service Yes Yes, Serverless Yes Storage Local storage Amazon S3 Amazon S3, HDFS Optimization Columnar storage, data compression, and zone maps CSV, TSV, JSON, Parquet, ORC, Apache Web log Framework dependent Metadata Amazon Redshift managed Athena Catalog Manager Hive Meta-store BI tools supports Yes (JDBC/ODBC) Yes (JDBC) Yes (JDBC/ODBC & Custom) Access controls Users, groups, and access controls AWS IAM Integration with LDAP UDF support Yes (Scalar) No Yes Slow