Mais conteúdo relacionado Semelhante a Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber - STG312 - re:Invent 2017 (20) Mais de Amazon Web Services (20) Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with Special Guests, Airbnb & Viber - STG312 - re:Invent 20171. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Best Practices for Building a Data Lake
on Amazon S3 & Amazon Glacier
w i t h s p e c i a l g u e s t s : V i b e r a n d A i r b n b
J o h n M a l l o r y , B u s i n e s s D e v e l o p m e n t , S t o r a g e
P D D u t t a , S r . P r o d u c t M a n a g e r , A m a z o n S 3
S T G 3 1 2
N o v e m b e r 3 0 , 2 0 1 7
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to expect
• Defining the AWS data lake on Amazon S3 and Amazon Glacier
• Data cataloging
• Security, performance, and analytics best practices
• Special guests Viber and Airbnb
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Defining the AWS data lake
Data lake is an architecture with a virtually
limitless centralized storage platform capable
of categorization, processing, analysis, and
consumption of heterogeneous data sets
Key data lake attributes
• Decoupled storage and compute
• Rapid ingest and transformation
• Secure multi-tenancy
• Query in place
• Schema on read
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Amazon Redshift
Data Warehouse
Amazon EMR
Clusterless SQL Query
Amazon Athena
Clusterless ETL
Amazon Glue
BI & Visualization
Hadoop/Hive/Presto
Batch processing
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
Streaming and real-time analytics
AWS Lambda
Amazon
Elasticsearch
Service
Apache Storm
on EMR
Apache Flink
on EMR
Amazon Kinesis
Analytics
Spark Streaming
on EMR
Amazon
ElastiCache
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a data lake?
Amazon
Glacier
Amazon
S3
AI and machine learning
Life-like speech
Amazon Polly
Amazon Lex
Conversational
engine
Amazon Rekognition
Image analysis
Deep learning
Frameworks
MXNet, TensorFlow,
Theano, Caffe, Torch
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched durability,
availability, and scalability
Best security, compliance, and audit
capability
Object-level control
at any scale
Business insight into
your data
Twice as many partner
integrations
Most ways to bring
data in
Reasons to choose Amazon S3 for data lake
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimize costs with data tiering
Hot
Cold
Amazon
S3 standard
Amazon S3—
infrequent access
Amazon
Glacier
HDFS ü Use EMR/Hadoop with local
HDFS for hottest data sets
ü Store cooler data in S3 and
Glacier to reduce costs
ü Use S3 Analytics to optimize
tiering strategy
S3 Analytics
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Multiple data lake ingestion methods
AWS Snowball and AWS Snowmobile
• PB-scale migration
AWS Storage Gateway
• Migrate legacy files
Native/ISV Connectors
• Ecosystem integration
Amazon S3 Transfer Acceleration
• Long-distance data transfer
AWS Direct Connect
• On-premises integration
Amazon Kinesis Firehose
• Ingest device streams
• Transform and store on
Amazon S3
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Catalog your S3 data
AWS Lambda
AWS Lambda
Metadata index
(DynamoDB)
Search index
(Amazon Elasticsearch
Service or Amazon
CloudSearch)
ObjectCreated
ObjectDeleted PutItem
Update Stream
Update Index
Extract Search Fields
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue analytics data catalog
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue analytics data catalog
Manage table metadata through a Hive metastore API or Hive SQL. Supported by tools
like Hive, Presto, Spark, etc.
We added a few extensions:
§ Search over metadata for data discovery
§ Connection info—JDBC URLs, credentials
§ Classification for identifying and parsing files
§ Versioning of table metadata as schemas evolve and other metadata are updated
Populate using Hive DDL, bulk import, or automatically through crawlers.
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Populating the AWS Glue data catalog
Automatically discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom classifiers using Grok expressions
Run via Lambda triggers or scheduled; serverless—only pay when crawler runs
Crawlers automatically build your data catalog and keep it in sync
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Securing your data on Amazon S3
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Encryption
• Default encryption
• Server-side encryption
• Client-side encryption
• SSL endpoints
• Encryption status in inventory
• CRR with KMS
Identity and access
• Amazon Macie
• Permission checks
• AWS Config Rules
• IAM & bucket policies
• Access control lists
Compliance
• Certifications—HIPAA,
FedRAMP, PCI-DSS
• Cloud HSM integration
• Versioning & MFA deletes
• Audit logging
AWS data lake security entitlements
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security: Access to encryption keys
IAM Security Token
Service
Temporary
Credentials
Customer
Master Key
Customer Data
Keys
Ciphertext Key Plaintext Key
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security: Access to encryption keys
IAM Security Token
Service
Temporary
Credentials
Customer
Master Key
Customer Data
Keys
Ciphertext Key Plaintext Key
Amazon S3
S3 object
……
Name: MyData
Key: Ciphertext Key
…..
My Data
My Data
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IAM best practices
SSL/TLS connections
Server-side encryption
Bucket policies
Versioning; recycle bin
MFA deletes
Security for your data lake
Pro Tip
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Optimizing performance on Amazon S3
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Getting high throughput with Amazon S3
examplebucket/232a-2017-26-05-15-00-00/cust1234234/photo1.jpg
examplebucket/7b54-2017-26-05-15-00-00/cust3857422/photo2.jpg
examplebucket/921c-2017-26-05-15-00-00/cust1248473/photo2.jpg
examplebucket/animations/232a-2017-26-05-15-00-00/cust1234234/animation1.obj
examplebucket/videos/ba65-2017-26-05-15-00-00/cust8474937/video2.mpg
examplebucket/photos/8761-2017-26-05-15-00-00/cust1248473/photo3.jpg
A bit more LIST friendly:
Random hash should come before patterns such as dates and sequential IDs
Always first ensure that your application can accommodate
Most customers need not worry about introducing entropy in key names
Consider 3-4 character hash for higher requests per second
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Aggregate small files
EMR: S3distcp
Amazon Kinesis Firehose
S3 Select
Big data cheaper, faster
Up to 400% faster
Data Formats
Columnar formats
EMRFS consistent view
Optimizing data lake performance
Amazon
S3
Amazon
DynamoDB
Pro Tip
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big data analytics & query in place
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3
Data Catalog
AthenaEMR Amazon
Redshift
Spectrum
Amazon ML/MXNet
RDS
Amazon QuickSight
Kinesis
Database
Migration
Service
AWS Glue
IAM
Other
Sources
Amazon analytics end-to-end architecture
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Introducing Amazon S3 Select
Simple API to retrieve subset of data
based on a SQL expression
Accelerate performance for
data retrieval and processing
by up to 400%
Simplify compute by retrieving
subset of data in a common
format
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Highly distributed
processing frameworks
such as Hadoop/Spark
Compress datasets
Columnar file formats
Amazon EMR: Decouple compute & storage
Aggregate small files
S3distcp “group-by” clause
Pro Tip
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Structured data w/ joins
Multiple on-demand
clusters-scale concurrency
Columnar file formats
Data partitioning
Better query performance
with predicate pushdown
Amazon Redshift Spectrum: Exabyte Scale
query-in-place
Pro Tip
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless service
Schema on read
Compress datasets
Columnar file formats
Amazon Athena: Query without ETL
Optimize file sizes
Optimize querying (Presto
backend)
Pro Tip
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use the right data formats
• Pay by the amount of data scanned per query
• Use compressed columnar formats
• Parquet
• ORC
• Easy to integrate with wide variety of tools
Dataset Size on Amazon S3 Query Run time Data Scanned Cost
Logs stored as text files 1 TB 237 seconds 1.15TB $5.75
Logs stored in Apache
Parquet format*
130 GB 5.13 seconds 2.69 GB $0.013
Savings
87% less with
Parquet
34x faster 99% less data scanned 99.7% cheaper
29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber data lake
Amir Ish-Shalom
Chief Architect, Viber
Special guest: Viber
30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• Messaging (including group)
• Secure end-to-end encryption
• Rich media and chat extensions
• Full multiple device support
• HD video and voice calls
• Viber out and Viber in
• Public chats and accounts
• Chatbots
31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big data @ Viber
• Close to 1 billion users worldwide
• Globally used in 230 countries
• 10-15 billion events daily (2 TB)
• 300,000 events per second (peak hours)
• 5 PB of data stored on Amazon S3/Amazon Glacier
• NoSQL DB (Couchbase) performing
2 million TPS on 20 TB of data
with 35 billion keys
33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber—big data architecture
RT data pipeline
Kinesis
Viber backend
servers
RT data processor
Apache Storm
Data lake
Amazon S3
Raw data backup
Kinesis Firehose
ETL jobs
Spark, Presto, Pig,
Lambda functions
Query engines
Presto, Athena,
Spark SQL, Pig
Reporting tools
Tableau, Redash,
Zeppelin, others
Databases &
data warehouses
Amazon Redshift, Aurora,
MySQL
Events
NoSQL profile database
Couchbase
34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake challenges
Use case #1: S3 performance
Use case #2: Data access rights
Use case #3: Encrypted data storage
Use case #4: Storage of data from third parties
35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Challenge:
• Over 300 different event types with large throughput variance
• Storm data processor created many small files, especially for lower
throughput events
• Events were stored in Hive partitioned folders (Y/M/D/H), which
are not optimal for Amazon S3
• Running a query over these events using Presto could generate up
to 15K tps on a single S3 bucket, resulting in 5xx errors and
throttling the whole bucket for other processes
36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Solution:
• Concatenate small files into large files, optimally 100 MB+
• Convert files into columnar file format such as Parquet or ORC
S3DistCp Spark
Concatenate
files
Convert to
Parquet
S3 S3 S3
37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: S3 performance
Future solution:
• Concatenate and convert files in a single process (Glue?)
• Use better partitioned hive directory format (H/D/M/Y instead of
Y/M/D/H)
• Use even larger files for high-throughput events
Spark/AWS Glue
Concatenate files &
convert to Parquet
S3 S3
38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Challenge:
• Events can contain sensitive personal data
• Allow access to events without exposing personal data
39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Solution #1:
• Create separate Hive metastores for full and redacted data
• Reporting tools will select relevant metastore based on current
user
Event definitions
Aurora RDS
Full access
Hive Metastore
Redacted access
Hive Metastore
40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Data access rights
Solution #2:
• Anonymize sensitive personal data
• Legal compliancy issues
• Limits data science capabilities
RT data pipeline
Kinesis
RT data processor
Apache Storm
Data lake
S3
Raw Data Anonymized Data
41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Encrypted data storage
Challenge:
• Store daily backups in S3
• Backup must be encrypted
• Strict access control
• Regional replication
• Complex data retention
42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Case: Encrypted data storage
SSE-KMSViber Back-End
Servers
Local region
data storage—Amazon S3
Remote regional
data storage—
Amazon S3
Solution
Cross-region
replication
CRR-KMS
Security—encrypt using SSE-KMS
Access—require permissions to both S3 bucket & KMS key
Tagging—use S3 object-level tagging to apply different
lifecycle policies to certain objects
Multiple regions—use CRR-KMS
43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Storage of data from third parties
Challenge:
• Securely store data from third parties in data lake
• Validate data before storing
• Allow optional data transformation
44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use case: Storage of data from third parties
S3 (external bucket)Third party Lambda SSE-KMS S3 data lake
Third-party access
via access keys
Lambda validates data,
performs transformations
and u/l it using KMS to
another S3 bucket in the
Viber data lake
Solution:
45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Viber data lake—summary
46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Airbnb—tiered storage system
Hongbo Zeng
Software Engineer, Airbnb
S p e c i a l g u e s t : A i r b n b
47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Agenda
• Challenges
• Tiered storage system
• S3A+ file system
48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Motivations
• 3x+ YoY data growth
• HDFS bottlenecks
• Namenode scalability
• Cost
• S3 is an object store
• Eventual consistency
• Metadata retrieval performance
• Read/write performance
49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Tiered storage system
• HDFS + S3
• Hot data on HDFS
• Warm and cold data on S3
• Bring the best of both together
• Performance
• Scalability
• Cost
HDFS
Cluster
S3
Clients
(Hive, Presto, Spark and etc)
50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Architecture
HDFS/Hive
Cluster
S3
Archive
Policy
Retention
Policy
UI
FSImage /
Metastore
Pipeline
Storage
Processor
Data
Archive
Data
Retention
Data
Backup
51. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Backup
DB Table Part
foo bar baz
Dest paths for
foo/bar/baz/a0
foo/bar/baz/a1
Dest paths for
foo/bar/baz/a2
foo/bar/baz/a3
Generate paths for
files
Reducers
Src Dest
foo/bar/baz/a0 s3://bkt/foo/bar/baz/data/a0
foo/bar/baz/a1 s3://bkt/foo/bar/baz/data/a1
foo/bar/baz/a2 s3://bkt/foo/bar/baz/data/a2
foo/bar/baz/a3 s3://bkt/foo/bar/baz/data/a3
Copy
foo/bar/baz/a0
foo/bar/baz/a1
Copy
foo/bar/baz/a2
foo/bar/baz/a3
CRC for
foo/bar/baz/a0
foo/bar/baz/a1
CRC for
foo/bar/baz/a2
foo/bar/baz/a3
Tags for
foo/bar/baz
Copy files
Generate and
compare CRC
Update the
metadata for
partitions
Reducers Reducers
Reducers
52. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Archive
• Metadata validation
• Is there a successful backup?
• Is the backup location valid?
• Anything changed since the latest backup?
• Data validation
• File count
• File size
• Archive
• Update the location of partitions
53. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The journey of a partition
2017-12-312017-12-302017-11-302017-11-30
generate a partition
foo.db/bar/ds=11-30
backup the partition to
s3://bkt/foo.db/bar/ds=1
1-30
archive the partition
foo.db/bar/ds=11-30
delete the data from
HDFS
54. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problem solved?
• HDFS bottlenecks
• Namenode scalability
• Cost
• S3 is an object store
• Eventual consistency
• Metadata retrieval performance
• Read/write performance
55. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Problem solved … partly
• HDFS bottlenecks
• Namenode scalability
• Cost
• S3 is an object store
• Eventual consistency
• Metadata retrieval performance
• Read/write performance
56. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3A+ file system
• Cache metadata
• Leverage S3 multipart API
• Prefetch data for reads
57. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata cache in MySQL
Path Is dir Is empty Length
s3a://bucket/foo/bar/baz/data 1 0 0
s3a://bucket/foo/bar/baz/data/a0 0 0 100
s3a://bucket/foo/bar/baz/data/a1 0 0 200
s3a://bucket/foo/bar/baz/data/a2 0 0 300
58. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Metadata cache in MySQL
Path Is dir Is empty Length
s3a://bucket/foo/bar/baz/data 1 0 0
s3a://bucket/foo/bar/baz/data/a0 0 0 100
s3a://bucket/foo/bar/baz/data/a1 0 0 200
s3a://bucket/foo/bar/baz/data/a2 0 0 300
30x Speed Up
59. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 multipart API
• Improved throughput
• Quick recovery from any network issues
• Pause and resume object uploads
• Begin an upload before you know the final object size
60. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3 multipart API
• Improved throughput
• Quick recovery from any network issues
• Pause and resume object uploads
• Begin an upload before you know the final object size
61. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
S3A multipart API
62. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Read prefetch
63. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance
0
2000
4000
6000
8000
10000
12000
1 2 3 4 5 6
Latency(seconds)
Hive Queries
Hive Query Performance
S3A latency S3A+ latency
64. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Putting it all together
65. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
To summarize
ü Always store a copy of the raw input
ü Use automation with S3 events to enable trigger-based workflows
ü Implement the right security controls
ü Use a format that supports your data, rather than forcing your data into
the format
ü Partition data to improve performance
ü Apply compression to lower network load and cost
66. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
For Enterprise Storage Engineers
• Learn how to architect and
manage highly available
solutions on AWS storage
services
• Advance toward AWS
certifications
• Help your organization migrate
to the cloud faster
Online at www.aws.training
• Access 100+ new digital
training courses including
advanced training on storage
• Deep dives on Amazon S3, EFS,
and EBS
• Migrating and tiering storage to
AWS (hybrid solutions)
At re:Invent
• Visit Hands-on Labs at the
Venetian
• Attend a proctored
“Introduction to EFS” Spotlight
Lab on Thursday at 3pm at the
Venetian
• Meet storage experts at the Ask
the Experts in Hands-on Labs
room at the Venetian
New storage training
67. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Q&A
Amazon S3 Amazon Glacier
68. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank You!