5. Definition
“A data lake provides massive storage for
any kind of data, enormous processing
power and the ability to handle virtually
limitless concurrent tasks or jobs”
- Wikipedia
6. Characteristics of a Data Lake
Collect
Everything
Dive in
Anywhere
Flexible
Access
14. New Business Outcomes and Capabilities
• Enable New Insights in Your Data
• Cost Savings of Compute and Storage
• Use the Right Tool for the Job
• Increase Durability of Data
• Charge Storage Costs to Owner
• Streaming and Real-time Analysis
Retain all your data, for years!
19. Requirements for Storage
• Multi-year Scalable Storage Capability
• High Durability
• Store Raw Data from Any Input Sources
• Support for Any Data Type
• Low Cost
22. Recommendations #1
• S3 Buckets
• Close to Users and Compute
• Select Region for Regulatory Compliance
• Naming
• Human-readable Path
• Random Hash Prefix for Optimal Partitioning
• Format
• Structured vs Unstructured + Compression
• CSV, Parquet, ORC, JSON, XML, logs, etc
• GZIP for small files, Avro, LZO, Snappy
23. Recommendations #2
• Optimise
• Store Everything
• Use Large Files with Split-able Format
• Lifecycle Policies for Cost-savings
• Tagging for Cost Allocation
• Security
• Encryption
• Bucket Policies, ACL, Tagging, CloudTrail
24. Requirements for Ingestion
• Batch File Support
• Traditional ETL
• Streaming Data
• Consumption of any Dataset as a Stream
• Low Latency Analytics
• Replay-ability from the Data Lake
• Server-less ETL Capabilities
25. Amazon Kinesis Firehose
1. Easy to use with Agent
2. Automatic Elasticity
3. Near Real-time
4. Simultaneous Destinations
Key Services for Ingestion
Amazon Kinesis Streams
1. Enables Custom Processing
2. Continuous Data Collection
3. Real-time
4. API Driven for Custom Apps
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
28. Recommendations
• Reminder
• Added Complexity needs Business Justification
• Select the Right Tools
• Real-time Analysis: Apache Spark Streaming, Storm, Flink
• Firehose to Redshift for BI and Dashboards
• Tips
• AWS Lambda for ETL Transformation
• Persist Streams into S3
32. Requirements for Catalogue and Search
• Metadata Index
• Automated Metadata Processing
• Discovery and Search
• Data Classification
• Server-less and Event-driven
42. Recommendations
• Start Early
• Security Needs Practice!
• Federate with your Corporate Directory
• Best Practice
• Use CloudTrail and CloudWatch
• Encrypt Where Possible
• Select Bucket Region for Regulatory Compliance
• Tips
• IAM Policies, S3 Versioning and MFA Delete
• Lambda for Data Masking
43. API and UI
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
44. Requirements for API and UI
• Serve Data and Capabilities to Customers
• Programmatically
• Search Catalogue
• Run Compute
• Extend Access Control Management
• And… Use of Familiar Visualisation Tools
45. Amazon API Gateway
1. Performance at Any Scale
2. Create RESTful Frontend
3. Managed API Lifecycle
Key Services for API and UI
AWS Lambda
1. Enables Server-less API
2. Custom Logic for Services
3. Automatic Scaling
AWS
Lambda
Amazon API
Gateway
47. Recommendations
• Tips
• Go Server-less!
• Extend Existing AWS Services and Build Custom Logic
• Data Management, Processing and Transformations
• API Gateway for Data Access
• Serve the Data, Search and Compute via RESTful APIs
• Distribute a Custom SDK
• Extend the Solution
• Build Advanced Security Controls using Metadata Index
48. The Whole Picture…
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
Storage and
Ingestion
Catalogue and
Search
Security
API and UI
50. A Data Lake is…
• Foundation of Data Storage and Streaming Data
• Metadata index to help Categorise and Govern
• Search Index to Enable Data Discovery
• Robust Set of Security Controls
• Governance Through Technology Not Policy
• Interface to Expose Data and Capabilities to Users
77. Next Steps
• How to Get Started
• AWS Documentation
• Getting Started Guide
• AWS Training & Certification
• Big Data on AWS
• AWS Partner Network
• AWS Professional Services
• Big Data Specialists
78. AWS Training & Certification
Intro Videos & Labs
Free videos and labs to
help you learn to work
with 30+ AWS services
– in minutes!
Training Classes
In-person and online
courses to build
technical skills –
taught by accredited
AWS instructors
Online Labs
Practice working with
AWS services in live
environment –
Learn how related
services work
together
AWS Certification
Validate technical
skills and expertise –
identify qualified IT
talent or show you
are AWS cloud ready
Learn more: aws.amazon.com/training
79. Your Training Next Steps:
Visit the AWS Training & Certification pod to discuss your
training plan & AWS Summit training offer
Register & attend AWS instructor led training
Get Certified
AWS Certified? Visit the AWS Summit Certification Lounge to pick up your swag
Learn more: aws.amazon.com/training