SlideShare uma empresa Scribd logo
1 de 68
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Best Practices for Building a Data Lake
on Amazon S3 & Amazon Glacier
w i t h s p e c i a l g u e s t s : V i b e r a n d A i r b n b
J o h n M a l l o r y , B u s i n e s s D e v e l o p m e n t , S t o r a g e
P D D u t t a , S r . P r o d u c t M a n a g e r , A m a z o n S 3
S T G 3 1 2
N o v e m b e r 3 0 , 2 0 1 7
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to expect
• Defining the AWS data lake on Amazon S3 and Amazon Glacier
• Data cataloging
• Security, performance, and analytics best practices
• Special guests Viber and Airbnb
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Defining the AWS data lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous data sets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Amazon Redshift
Data Warehouse
Amazon EMR
Clusterless SQL Query
Amazon Athena
Clusterless ETL
Amazon Glue
BI & Visualization
Hadoop/Hive/Presto
Batch processing
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Streaming and real-time analytics
AWS Lambda
Amazon
Elasticsearch
Service
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Spark Streaming
on EMR
Amazon
ElastiCache
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
AI and machine learning
Life-like speech
Amazon Polly
Amazon Lex
Conversational
engine
Amazon Rekognition
Image analysis
Deep learning
Frameworks
MXNet, TensorFlow,
Theano, Caffe, Torch
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched durability,
availability, and scalability
Best security, compliance, and audit
capability
Object-level control
at any scale
Business insight into
your data
Twice as many partner
integrations
Most ways to bring
data in
Reasons to choose Amazon S3 for data lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS ü Use EMR/Hadoop with local
HDFS for hottest data sets
ü Store cooler data in S3 and
Glacier to reduce costs
ü Use S3 Analytics to optimize
tiering strategy
S3 Analytics
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multiple data lake ingestion methods
AWS Snowball and AWS Snowmobile
• PB-scale migration
AWS Storage Gateway
• Migrate legacy files
Native/ISV Connectors
• Ecosystem integration
Amazon S3 Transfer Acceleration
• Long-distance data transfer
AWS Direct Connect
• On-premises integration
Amazon Kinesis Firehose
• Ingest device streams
• Transform and store on
Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Catalog your S3 data
AWS Lambda
AWS Lambda
Metadata index
(DynamoDB)
Search index
(Amazon Elasticsearch
Service or Amazon
CloudSearch)
ObjectCreated
ObjectDeleted PutItem
Update Stream
Update Index
Extract Search Fields
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue analytics data catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue analytics data catalog
Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools
like Hive, Presto, Spark, etc.
We added a few extensions:
§ Search over metadata for data discovery
§ Connection info—JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other metadata are updated
Populate using Hive DDL, bulk import, or automatically through crawlers.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Populating the AWS Glue data catalog
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run via Lambda triggers or scheduled; serverless—only pay when crawler runs
Crawlers automatically build your data catalog and keep it in sync
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Securing your data on Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Encryption
• Default encryption
• Server-side encryption
• Client-side encryption
• SSL endpoints
• Encryption status in inventory
• CRR with KMS
Identity and access
• Amazon Macie
• Permission checks
• AWS Config Rules
• IAM & bucket policies
• Access control lists
Compliance
• Certifications—HIPAA,
FedRAMP, PCI-DSS
• Cloud HSM integration
• Versioning & MFA deletes
• Audit logging
AWS data lake security entitlements
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security: Access to encryption keys
IAM Security Token
Service
Temporary
Credentials
Customer
Master Key
Customer Data
Keys
Ciphertext Key Plaintext Key
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security: Access to encryption keys
IAM Security Token
Service
Temporary
Credentials
Customer
Master Key
Customer Data
Keys
Ciphertext Key Plaintext Key
Amazon S3
S3 object
……
Name: MyData
Key: Ciphertext Key
…..
My Data
My Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IAM best practices
SSL/TLS connections
Server-side encryption
Bucket policies
Versioning; recycle bin
MFA deletes
Security for your data lake
Pro Tip
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimizing performance on Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting high throughput with Amazon S3
examplebucket/232a-2017-26-05-15-00-00/cust1234234/photo1.jpg
examplebucket/7b54-2017-26-05-15-00-00/cust3857422/photo2.jpg
examplebucket/921c-2017-26-05-15-00-00/cust1248473/photo2.jpg
examplebucket/animations/232a-2017-26-05-15-00-00/cust1234234/animation1.obj
examplebucket/videos/ba65-2017-26-05-15-00-00/cust8474937/video2.mpg
examplebucket/photos/8761-2017-26-05-15-00-00/cust1248473/photo3.jpg
A bit more LIST friendly:
Random hash should come before patterns such as dates and sequential IDs
Always first ensure that your application can accommodate
Most customers need not worry about introducing entropy in key names
Consider 3-4 character hash for higher requests per second
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Aggregate small files
EMR: S3distcp
Amazon Kinesis Firehose
S3 Select
Big data cheaper, faster
Up to 400% faster
Data Formats
Columnar formats
EMRFS consistent view
Optimizing data lake performance
Amazon
S3
Amazon
DynamoDB
Pro Tip
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big data analytics & query in place
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3
Data Catalog
AthenaEMR Amazon
Redshift
Spectrum
Amazon ML/MXNet
RDS
Amazon QuickSight
Kinesis
Database
Migration
Service
AWS Glue
IAM
Other
Sources
Amazon analytics end-to-end architecture
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introducing Amazon S3 Select
Simple API to retrieve subset of data
based on a SQL expression
Accelerate performance for
data retrieval and processing
by up to 400%
Simplify compute by retrieving
subset of data in a common
format
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Highly distributed
processing frameworks
such as Hadoop/Spark
Compress datasets
Columnar file formats
Amazon EMR: Decouple compute & storage
Aggregate small files
S3distcp “group-by” clause
Pro Tip
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Structured data w/ joins
Multiple on-demand
clusters-scale concurrency
Columnar file formats
Data partitioning
Better query performance
with predicate pushdown
Amazon Redshift Spectrum: Exabyte Scale
query-in-place
Pro Tip
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless service
Schema on read
Compress datasets
Columnar file formats
Amazon Athena: Query without ETL
Optimize file sizes
Optimize querying (Presto
backend)
Pro Tip
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use the right data formats
• Pay by the amount of data scanned per query
• Use compressed columnar formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as text files 1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings
87% less with
Parquet
34x faster 99% less data scanned 99.7% cheaper
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber data lake
Amir Ish-Shalom
Chief Architect, Viber
Special guest: Viber
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Messaging (including group)
• Secure end-to-end encryption
• Rich media and chat extensions
• Full multiple device support
• HD video and voice calls
• Viber out and Viber in
• Public chats and accounts
• Chatbots
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big data @ Viber
• Close to 1 billion users worldwide
• Globally used in 230 countries
• 10-15 billion events daily (2 TB)
• 300,000 events per second (peak hours)
• 5 PB of data stored on Amazon S3/Amazon Glacier
• NoSQL DB (Couchbase) performing
2 million TPS on 20 TB of data
with 35 billion keys
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber—big data architecture
RT data pipeline
Kinesis
Viber backend
servers
RT data processor
Apache Storm
Data lake
Amazon S3
Raw data backup
Kinesis Firehose
ETL jobs
Spark, Presto, Pig,
Lambda functions
Query engines
Presto, Athena,
Spark SQL, Pig
Reporting tools
Tableau, Redash,
Zeppelin, others
Databases &
data warehouses
Amazon Redshift, Aurora,
MySQL
Events
NoSQL profile database
Couchbase
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake challenges
Use case #1: S3 performance
Use case #2: Data access rights
Use case #3: Encrypted data storage
Use case #4: Storage of data from third parties
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Challenge:
• Over 300 different event types with large throughput variance
• Storm data processor created many small files, especially for lower
throughput events
• Events were stored in Hive partitioned folders (Y/M/D/H), which
are not optimal for Amazon S3
• Running a query over these events using Presto could generate up
to 15K tps on a single S3 bucket, resulting in 5xx errors and
throttling the whole bucket for other processes
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Solution:
• Concatenate small files into large files, optimally 100 MB+
• Convert files into columnar file format such as Parquet or ORC
S3DistCp Spark
Concatenate
files
Convert to
Parquet
S3 S3 S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Future solution:
• Concatenate and convert files in a single process (Glue?)
• Use better partitioned hive directory format (H/D/M/Y instead of
Y/M/D/H)
• Use even larger files for high-throughput events
Spark/AWS Glue
Concatenate files &
convert to Parquet
S3 S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Challenge:
• Events can contain sensitive personal data
• Allow access to events without exposing personal data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Solution #1:
• Create separate Hive metastores for full and redacted data
• Reporting tools will select relevant metastore based on current
user
Event definitions
Aurora RDS
Full access
Hive Metastore
Redacted access
Hive Metastore
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Solution #2:
• Anonymize sensitive personal data
• Legal compliancy issues
• Limits data science capabilities
RT data pipeline
Kinesis
RT data processor
Apache Storm
Data lake
S3
Raw Data Anonymized Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Encrypted data storage
Challenge:
• Store daily backups in S3
• Backup must be encrypted
• Strict access control
• Regional replication
• Complex data retention
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Case: Encrypted data storage
SSE-KMSViber Back-End
Servers
Local region
data storage—Amazon S3
Remote regional
data storage—
Amazon S3
Solution
Cross-region
replication
CRR-KMS
Security—encrypt using SSE-KMS
Access—require permissions to both S3 bucket & KMS key
Tagging—use S3 object-level tagging to apply different
lifecycle policies to certain objects
Multiple regions—use CRR-KMS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Storage of data from third parties
Challenge:
• Securely store data from third parties in data lake
• Validate data before storing
• Allow optional data transformation
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Storage of data from third parties
S3 (external bucket)Third party Lambda SSE-KMS S3 data lake
Third-party access
via access keys
Lambda validates data,
performs transformations
and u/l it using KMS to
another S3 bucket in the
Viber data lake
Solution:
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber data lake—summary
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Airbnb—tiered storage system
Hongbo Zeng
Software Engineer, Airbnb
S p e c i a l g u e s t : A i r b n b
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Challenges
• Tiered storage system
• S3A+ file system
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Motivations
• 3x+ YoY data growth
• HDFS bottlenecks
• Namenode scalability
• Cost
• S3 is an object store
• Eventual consistency
• Metadata retrieval performance
• Read/write performance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tiered storage system
• HDFS + S3
• Hot data on HDFS
• Warm and cold data on S3
• Bring the best of both together
• Performance
• Scalability
• Cost
HDFS
Cluster
S3
Clients
(Hive, Presto, Spark and etc)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
HDFS/Hive
Cluster
S3
Archive
Policy
Retention
Policy
UI
FSImage /
Metastore
Pipeline
Storage
Processor
Data
Archive
Data
Retention
Data
Backup
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Backup
DB Table Part
foo bar baz
Dest paths for
foo/bar/baz/a0
foo/bar/baz/a1
Dest paths for
foo/bar/baz/a2
foo/bar/baz/a3
Generate paths for
files
Reducers
Src Dest
foo/bar/baz/a0 s3://bkt/foo/bar/baz/data/a0
foo/bar/baz/a1 s3://bkt/foo/bar/baz/data/a1
foo/bar/baz/a2 s3://bkt/foo/bar/baz/data/a2
foo/bar/baz/a3 s3://bkt/foo/bar/baz/data/a3
Copy
foo/bar/baz/a0
foo/bar/baz/a1
Copy
foo/bar/baz/a2
foo/bar/baz/a3
CRC for
foo/bar/baz/a0
foo/bar/baz/a1
CRC for
foo/bar/baz/a2
foo/bar/baz/a3
Tags for
foo/bar/baz
Copy files
Generate and
compare CRC
Update the
metadata for
partitions
Reducers Reducers
Reducers
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Archive
• Metadata validation
• Is there a successful backup?
• Is the backup location valid?
• Anything changed since the latest backup?
• Data validation
• File count
• File size
• Archive
• Update the location of partitions
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The journey of a partition
2017-12-312017-12-302017-11-302017-11-30
generate a partition
foo.db/bar/ds=11-30
backup the partition to
s3://bkt/foo.db/bar/ds=1
1-30
archive the partition
foo.db/bar/ds=11-30
delete the data from
HDFS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problem solved?
• HDFS bottlenecks
• Namenode scalability
• Cost
• S3 is an object store
• Eventual consistency
• Metadata retrieval performance
• Read/write performance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problem solved … partly
• HDFS bottlenecks
• Namenode scalability
• Cost
• S3 is an object store
• Eventual consistency
• Metadata retrieval performance
• Read/write performance
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3A+ file system
• Cache metadata
• Leverage S3 multipart API
• Prefetch data for reads
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata cache in MySQL
Path Is dir Is empty Length
s3a://bucket/foo/bar/baz/data 1 0 0
s3a://bucket/foo/bar/baz/data/a0 0 0 100
s3a://bucket/foo/bar/baz/data/a1 0 0 200
s3a://bucket/foo/bar/baz/data/a2 0 0 300
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata cache in MySQL
Path Is dir Is empty Length
s3a://bucket/foo/bar/baz/data 1 0 0
s3a://bucket/foo/bar/baz/data/a0 0 0 100
s3a://bucket/foo/bar/baz/data/a1 0 0 200
s3a://bucket/foo/bar/baz/data/a2 0 0 300
30x Speed Up
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 multipart API
• Improved throughput
• Quick recovery from any network issues
• Pause and resume object uploads
• Begin an upload before you know the final object size
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 multipart API
• Improved throughput
• Quick recovery from any network issues
• Pause and resume object uploads
• Begin an upload before you know the final object size
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3A multipart API
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Read prefetch
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6
Latency(seconds)
Hive Queries
Hive Query Performance
S3A latency S3A+ latency
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Putting it all together
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
To summarize
ü Always store a copy of the raw input
ü Use automation with S3 events to enable trigger-based workflows
ü Implement the right security controls
ü Use a format that supports your data, rather than forcing your data into
the format
ü Partition data to improve performance
ü Apply compression to lower network load and cost
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
For Enterprise Storage Engineers
• Learn how to architect and
manage highly available
solutions on AWS storage
services
• Advance toward AWS
certifications
• Help your organization migrate
to the cloud faster
Online at www.aws.training
• Access 100+ new digital
training courses including
advanced training on storage
• Deep dives on Amazon S3, EFS,
and EBS
• Migrating and tiering storage to
AWS (hybrid solutions)
At re:Invent
• Visit Hands-on Labs at the
Venetian
• Attend a proctored
“Introduction to EFS” Spotlight
Lab on Thursday at 3pm at the
Venetian
• Meet storage experts at the Ask
the Experts in Hands-on Labs
room at the Venetian
New storage training
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q&A
Amazon S3 Amazon Glacier
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank You!

Mais conteúdo relacionado

Mais procurados

AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
Simplilearn
 

Mais procurados (20)

Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
AWS Storage Gateway
AWS Storage GatewayAWS Storage Gateway
AWS Storage Gateway
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
AWS Security Hub
AWS Security HubAWS Security Hub
AWS Security Hub
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Deep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech TalksDeep Dive on Amazon Athena - AWS Online Tech Talks
Deep Dive on Amazon Athena - AWS Online Tech Talks
 
Azure Storage Services - Part 01
Azure Storage Services - Part 01Azure Storage Services - Part 01
Azure Storage Services - Part 01
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage Overview
 
A complete guide to azure storage
A complete guide to azure storageA complete guide to azure storage
A complete guide to azure storage
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
Amazon S3 Masterclass
Amazon S3 MasterclassAmazon S3 Masterclass
Amazon S3 Masterclass
 
Deep Dive on AWS Lambda
Deep Dive on AWS LambdaDeep Dive on AWS Lambda
Deep Dive on AWS Lambda
 
Amazon Cloud | Amazon Cloud Computing Tutorial | AWS Tutorial | AWS Training ...
Amazon Cloud | Amazon Cloud Computing Tutorial | AWS Tutorial | AWS Training ...Amazon Cloud | Amazon Cloud Computing Tutorial | AWS Tutorial | AWS Training ...
Amazon Cloud | Amazon Cloud Computing Tutorial | AWS Tutorial | AWS Training ...
 
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
AWS S3 | Tutorial For Beginners | AWS S3 Bucket Tutorial | AWS Tutorial For B...
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
있는 그대로 저장하고, 바로 분석 가능한, 새로운 관점의 데이터 애널리틱 플랫폼 - 정세웅 애널리틱 스페셜리스트, AWS
있는 그대로 저장하고, 바로 분석 가능한, 새로운 관점의 데이터 애널리틱 플랫폼 - 정세웅 애널리틱 스페셜리스트, AWS있는 그대로 저장하고, 바로 분석 가능한, 새로운 관점의 데이터 애널리틱 플랫폼 - 정세웅 애널리틱 스페셜리스트, AWS
있는 그대로 저장하고, 바로 분석 가능한, 새로운 관점의 데이터 애널리틱 플랫폼 - 정세웅 애널리틱 스페셜리스트, AWS
 
Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview Amazon Athena Capabilities and Use Cases Overview
Amazon Athena Capabilities and Use Cases Overview
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
AWS S3 Tutorial For Beginners | Edureka
AWS S3 Tutorial For Beginners | EdurekaAWS S3 Tutorial For Beginners | Edureka
AWS S3 Tutorial For Beginners | Edureka
 

Semelhante a Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber - STG312 - re:Invent 2017

Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
Amazon Web Services
 

Semelhante a Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber - STG312 - re:Invent 2017 (20)

Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
Serverless Architectural Patterns
Serverless Architectural PatternsServerless Architectural Patterns
Serverless Architectural Patterns
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech TalksHow to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
How to Build a Data Lake in Amazon S3 & Amazon Glacier - AWS Online Tech Talks
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
ABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWSABD201-Big Data Architectural Patterns and Best Practices on AWS
ABD201-Big Data Architectural Patterns and Best Practices on AWS
 
Deep Dive on New Features in Amazon S3 & Glacier - AWS Online Tech Talks
Deep Dive on New Features in Amazon S3 & Glacier - AWS Online Tech TalksDeep Dive on New Features in Amazon S3 & Glacier - AWS Online Tech Talks
Deep Dive on New Features in Amazon S3 & Glacier - AWS Online Tech Talks
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
Building Data Lakes That Cost Less and Deliver Results Faster - AWS Online Te...
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
Analytics Web Day | Query your Data in S3 with SQL and optimize for Cost and ...
 
Securing Your Big Data on AWS
Securing Your Big Data on AWSSecuring Your Big Data on AWS
Securing Your Big Data on AWS
 

Mais de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber - STG312 - re:Invent 2017

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Best Practices for Building a Data Lake on Amazon S3 & Amazon Glacier w i t h s p e c i a l g u e s t s : V i b e r a n d A i r b n b J o h n M a l l o r y , B u s i n e s s D e v e l o p m e n t , S t o r a g e P D D u t t a , S r . P r o d u c t M a n a g e r , A m a z o n S 3 S T G 3 1 2 N o v e m b e r 3 0 , 2 0 1 7
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to expect • Defining the AWS data lake on Amazon S3 and Amazon Glacier • Data cataloging • Security, performance, and analytics best practices • Special guests Viber and Airbnb
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Defining the AWS data lake Data lake is an architecture with a virtually limitless centralized storage platform capable of categorization, processing, analysis, and consumption of heterogeneous data sets Key data lake attributes • Decoupled storage and compute • Rapid ingest and transformation • Secure multi-tenancy • Query in place • Schema on read
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What can you do with a data lake? Amazon Glacier Amazon S3 Amazon Redshift Data Warehouse Amazon EMR Clusterless SQL Query Amazon Athena Clusterless ETL Amazon Glue BI & Visualization Hadoop/Hive/Presto Batch processing
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What can you do with a data lake? Amazon Glacier Amazon S3 Streaming and real-time analytics AWS Lambda Amazon Elasticsearch Service Apache Storm on EMR Apache Flink on EMR Amazon Kinesis Analytics Spark Streaming on EMR Amazon ElastiCache
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What can you do with a data lake? Amazon Glacier Amazon S3 AI and machine learning Life-like speech Amazon Polly Amazon Lex Conversational engine Amazon Rekognition Image analysis Deep learning Frameworks MXNet, TensorFlow, Theano, Caffe, Torch
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Unmatched durability, availability, and scalability Best security, compliance, and audit capability Object-level control at any scale Business insight into your data Twice as many partner integrations Most ways to bring data in Reasons to choose Amazon S3 for data lake
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimize costs with data tiering Hot Cold Amazon S3 standard Amazon S3— infrequent access Amazon Glacier HDFS ü Use EMR/Hadoop with local HDFS for hottest data sets ü Store cooler data in S3 and Glacier to reduce costs ü Use S3 Analytics to optimize tiering strategy S3 Analytics
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Multiple data lake ingestion methods AWS Snowball and AWS Snowmobile • PB-scale migration AWS Storage Gateway • Migrate legacy files Native/ISV Connectors • Ecosystem integration Amazon S3 Transfer Acceleration • Long-distance data transfer AWS Direct Connect • On-premises integration Amazon Kinesis Firehose • Ingest device streams • Transform and store on Amazon S3
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Catalog your S3 data AWS Lambda AWS Lambda Metadata index (DynamoDB) Search index (Amazon Elasticsearch Service or Amazon CloudSearch) ObjectCreated ObjectDeleted PutItem Update Stream Update Index Extract Search Fields
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue analytics data catalog
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue analytics data catalog Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools like Hive, Presto, Spark, etc. We added a few extensions: § Search over metadata for data discovery § Connection info—JDBC URLs, credentials § Classification for identifying and parsing files § Versioning of table metadata as schemas evolve and other metadata are updated Populate using Hive DDL, bulk import, or automatically through crawlers.
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Populating the AWS Glue data catalog Automatically discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 Built-in classifiers for popular types; custom classifiers using Grok expressions Run via Lambda triggers or scheduled; serverless—only pay when crawler runs Crawlers automatically build your data catalog and keep it in sync
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Securing your data on Amazon S3
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption • Default encryption • Server-side encryption • Client-side encryption • SSL endpoints • Encryption status in inventory • CRR with KMS Identity and access • Amazon Macie • Permission checks • AWS Config Rules • IAM & bucket policies • Access control lists Compliance • Certifications—HIPAA, FedRAMP, PCI-DSS • Cloud HSM integration • Versioning & MFA deletes • Audit logging AWS data lake security entitlements
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security: Access to encryption keys IAM Security Token Service Temporary Credentials Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security: Access to encryption keys IAM Security Token Service Temporary Credentials Customer Master Key Customer Data Keys Ciphertext Key Plaintext Key Amazon S3 S3 object …… Name: MyData Key: Ciphertext Key ….. My Data My Data
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. IAM best practices SSL/TLS connections Server-side encryption Bucket policies Versioning; recycle bin MFA deletes Security for your data lake Pro Tip
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Optimizing performance on Amazon S3
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting high throughput with Amazon S3 examplebucket/232a-2017-26-05-15-00-00/cust1234234/photo1.jpg examplebucket/7b54-2017-26-05-15-00-00/cust3857422/photo2.jpg examplebucket/921c-2017-26-05-15-00-00/cust1248473/photo2.jpg examplebucket/animations/232a-2017-26-05-15-00-00/cust1234234/animation1.obj examplebucket/videos/ba65-2017-26-05-15-00-00/cust8474937/video2.mpg examplebucket/photos/8761-2017-26-05-15-00-00/cust1248473/photo3.jpg A bit more LIST friendly: Random hash should come before patterns such as dates and sequential IDs Always first ensure that your application can accommodate Most customers need not worry about introducing entropy in key names Consider 3-4 character hash for higher requests per second
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Aggregate small files EMR: S3distcp Amazon Kinesis Firehose S3 Select Big data cheaper, faster Up to 400% faster Data Formats Columnar formats EMRFS consistent view Optimizing data lake performance Amazon S3 Amazon DynamoDB Pro Tip
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big data analytics & query in place
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon S3 Data Catalog AthenaEMR Amazon Redshift Spectrum Amazon ML/MXNet RDS Amazon QuickSight Kinesis Database Migration Service AWS Glue IAM Other Sources Amazon analytics end-to-end architecture
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Introducing Amazon S3 Select Simple API to retrieve subset of data based on a SQL expression Accelerate performance for data retrieval and processing by up to 400% Simplify compute by retrieving subset of data in a common format
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Highly distributed processing frameworks such as Hadoop/Spark Compress datasets Columnar file formats Amazon EMR: Decouple compute & storage Aggregate small files S3distcp “group-by” clause Pro Tip
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Structured data w/ joins Multiple on-demand clusters-scale concurrency Columnar file formats Data partitioning Better query performance with predicate pushdown Amazon Redshift Spectrum: Exabyte Scale query-in-place Pro Tip
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless service Schema on read Compress datasets Columnar file formats Amazon Athena: Query without ETL Optimize file sizes Optimize querying (Presto backend) Pro Tip
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use the right data formats • Pay by the amount of data scanned per query • Use compressed columnar formats • Parquet • ORC • Easy to integrate with wide variety of tools Dataset Size on Amazon S3 Query Run time Data Scanned Cost Logs stored as text files 1 TB 237 seconds 1.15TB $5.75 Logs stored in Apache Parquet format* 130 GB 5.13 seconds 2.69 GB $0.013 Savings 87% less with Parquet 34x faster 99% less data scanned 99.7% cheaper
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Viber data lake Amir Ish-Shalom Chief Architect, Viber Special guest: Viber
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Messaging (including group) • Secure end-to-end encryption • Rich media and chat extensions • Full multiple device support • HD video and voice calls • Viber out and Viber in • Public chats and accounts • Chatbots
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Big data @ Viber • Close to 1 billion users worldwide • Globally used in 230 countries • 10-15 billion events daily (2 TB) • 300,000 events per second (peak hours) • 5 PB of data stored on Amazon S3/Amazon Glacier • NoSQL DB (Couchbase) performing 2 million TPS on 20 TB of data with 35 billion keys
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Viber—big data architecture RT data pipeline Kinesis Viber backend servers RT data processor Apache Storm Data lake Amazon S3 Raw data backup Kinesis Firehose ETL jobs Spark, Presto, Pig, Lambda functions Query engines Presto, Athena, Spark SQL, Pig Reporting tools Tableau, Redash, Zeppelin, others Databases & data warehouses Amazon Redshift, Aurora, MySQL Events NoSQL profile database Couchbase
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake challenges Use case #1: S3 performance Use case #2: Data access rights Use case #3: Encrypted data storage Use case #4: Storage of data from third parties
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: S3 performance Challenge: • Over 300 different event types with large throughput variance • Storm data processor created many small files, especially for lower throughput events • Events were stored in Hive partitioned folders (Y/M/D/H), which are not optimal for Amazon S3 • Running a query over these events using Presto could generate up to 15K tps on a single S3 bucket, resulting in 5xx errors and throttling the whole bucket for other processes
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: S3 performance Solution: • Concatenate small files into large files, optimally 100 MB+ • Convert files into columnar file format such as Parquet or ORC S3DistCp Spark Concatenate files Convert to Parquet S3 S3 S3
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: S3 performance Future solution: • Concatenate and convert files in a single process (Glue?) • Use better partitioned hive directory format (H/D/M/Y instead of Y/M/D/H) • Use even larger files for high-throughput events Spark/AWS Glue Concatenate files & convert to Parquet S3 S3
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Data access rights Challenge: • Events can contain sensitive personal data • Allow access to events without exposing personal data
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Data access rights Solution #1: • Create separate Hive metastores for full and redacted data • Reporting tools will select relevant metastore based on current user Event definitions Aurora RDS Full access Hive Metastore Redacted access Hive Metastore
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Data access rights Solution #2: • Anonymize sensitive personal data • Legal compliancy issues • Limits data science capabilities RT data pipeline Kinesis RT data processor Apache Storm Data lake S3 Raw Data Anonymized Data
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Encrypted data storage Challenge: • Store daily backups in S3 • Backup must be encrypted • Strict access control • Regional replication • Complex data retention
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Case: Encrypted data storage SSE-KMSViber Back-End Servers Local region data storage—Amazon S3 Remote regional data storage— Amazon S3 Solution Cross-region replication CRR-KMS Security—encrypt using SSE-KMS Access—require permissions to both S3 bucket & KMS key Tagging—use S3 object-level tagging to apply different lifecycle policies to certain objects Multiple regions—use CRR-KMS
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Storage of data from third parties Challenge: • Securely store data from third parties in data lake • Validate data before storing • Allow optional data transformation
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case: Storage of data from third parties S3 (external bucket)Third party Lambda SSE-KMS S3 data lake Third-party access via access keys Lambda validates data, performs transformations and u/l it using KMS to another S3 bucket in the Viber data lake Solution:
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Viber data lake—summary
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Airbnb—tiered storage system Hongbo Zeng Software Engineer, Airbnb S p e c i a l g u e s t : A i r b n b
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Challenges • Tiered storage system • S3A+ file system
  • 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Motivations • 3x+ YoY data growth • HDFS bottlenecks • Namenode scalability • Cost • S3 is an object store • Eventual consistency • Metadata retrieval performance • Read/write performance
  • 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tiered storage system • HDFS + S3 • Hot data on HDFS • Warm and cold data on S3 • Bring the best of both together • Performance • Scalability • Cost HDFS Cluster S3 Clients (Hive, Presto, Spark and etc)
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Architecture HDFS/Hive Cluster S3 Archive Policy Retention Policy UI FSImage / Metastore Pipeline Storage Processor Data Archive Data Retention Data Backup
  • 51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Backup DB Table Part foo bar baz Dest paths for foo/bar/baz/a0 foo/bar/baz/a1 Dest paths for foo/bar/baz/a2 foo/bar/baz/a3 Generate paths for files Reducers Src Dest foo/bar/baz/a0 s3://bkt/foo/bar/baz/data/a0 foo/bar/baz/a1 s3://bkt/foo/bar/baz/data/a1 foo/bar/baz/a2 s3://bkt/foo/bar/baz/data/a2 foo/bar/baz/a3 s3://bkt/foo/bar/baz/data/a3 Copy foo/bar/baz/a0 foo/bar/baz/a1 Copy foo/bar/baz/a2 foo/bar/baz/a3 CRC for foo/bar/baz/a0 foo/bar/baz/a1 CRC for foo/bar/baz/a2 foo/bar/baz/a3 Tags for foo/bar/baz Copy files Generate and compare CRC Update the metadata for partitions Reducers Reducers Reducers
  • 52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Archive • Metadata validation • Is there a successful backup? • Is the backup location valid? • Anything changed since the latest backup? • Data validation • File count • File size • Archive • Update the location of partitions
  • 53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The journey of a partition 2017-12-312017-12-302017-11-302017-11-30 generate a partition foo.db/bar/ds=11-30 backup the partition to s3://bkt/foo.db/bar/ds=1 1-30 archive the partition foo.db/bar/ds=11-30 delete the data from HDFS
  • 54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problem solved? • HDFS bottlenecks • Namenode scalability • Cost • S3 is an object store • Eventual consistency • Metadata retrieval performance • Read/write performance
  • 55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Problem solved … partly • HDFS bottlenecks • Namenode scalability • Cost • S3 is an object store • Eventual consistency • Metadata retrieval performance • Read/write performance
  • 56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3A+ file system • Cache metadata • Leverage S3 multipart API • Prefetch data for reads
  • 57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata cache in MySQL Path Is dir Is empty Length s3a://bucket/foo/bar/baz/data 1 0 0 s3a://bucket/foo/bar/baz/data/a0 0 0 100 s3a://bucket/foo/bar/baz/data/a1 0 0 200 s3a://bucket/foo/bar/baz/data/a2 0 0 300
  • 58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Metadata cache in MySQL Path Is dir Is empty Length s3a://bucket/foo/bar/baz/data 1 0 0 s3a://bucket/foo/bar/baz/data/a0 0 0 100 s3a://bucket/foo/bar/baz/data/a1 0 0 200 s3a://bucket/foo/bar/baz/data/a2 0 0 300 30x Speed Up
  • 59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 multipart API • Improved throughput • Quick recovery from any network issues • Pause and resume object uploads • Begin an upload before you know the final object size
  • 60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3 multipart API • Improved throughput • Quick recovery from any network issues • Pause and resume object uploads • Begin an upload before you know the final object size
  • 61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. S3A multipart API
  • 62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Read prefetch
  • 63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Performance 0 2000 4000 6000 8000 10000 12000 1 2 3 4 5 6 Latency(seconds) Hive Queries Hive Query Performance S3A latency S3A+ latency
  • 64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Putting it all together
  • 65. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. To summarize ü Always store a copy of the raw input ü Use automation with S3 events to enable trigger-based workflows ü Implement the right security controls ü Use a format that supports your data, rather than forcing your data into the format ü Partition data to improve performance ü Apply compression to lower network load and cost
  • 66. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. For Enterprise Storage Engineers • Learn how to architect and manage highly available solutions on AWS storage services • Advance toward AWS certifications • Help your organization migrate to the cloud faster Online at www.aws.training • Access 100+ new digital training courses including advanced training on storage • Deep dives on Amazon S3, EFS, and EBS • Migrating and tiering storage to AWS (hybrid solutions) At re:Invent • Visit Hands-on Labs at the Venetian • Attend a proctored “Introduction to EFS” Spotlight Lab on Thursday at 3pm at the Venetian • Meet storage experts at the Ask the Experts in Hands-on Labs room at the Venetian New storage training
  • 67. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Q&A Amazon S3 Amazon Glacier
  • 68. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank You!