Data lakes allow organizations to store all types of data in a centralized repository at scale. AWS Lake Formation makes it easy to build secure data lakes by automatically registering and cleaning data, enforcing access permissions, and enabling analytics. Data stored in data lakes can be analyzed using services like Amazon Athena, Redshift, and EMR depending on the type of analysis and latency required.
7. Data
every 5 years
There is more data
than people think
15
years
live for
Data platforms need to
1,000x
scale
>10x
grows
8. There are more
people accessing data
And more
requirements for
making data available
Data Scientists
Analysts
Business Users
Applications
Secure Real time
Flexible Scalable
14. Analytics used to look like this
OLTP ERP CRM LOB
Data warehouse
Business intelligence
Relational data
TBs–PBs scale
Schema defined prior to data load
Operational reporting and ad hoc
Large initial CAPEX + $10K $50K/TB/Year
15. A data lake is a centralized repository that
allows you to store all your structured and
unstructured data at any scale
16. Why data lakes?
Data Lakes provide:
Relational and non-relational data
Scale-out to EBs
Diverse set of analytics and machine learning tools
Work on data without any data movement
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111001010
1011100101010000101111101
1010
0011110010110010110
0100011000010
Devices Web Sensors Social
Catalog
Machine
Learning
DW Queries Big data
processing
Interactive Real-time
17. • OLTP (Online Transaction Processing)
Characterized by a large number of short transactions (INSERT,
UPDATE, DELETE) that serve as persistence layer for applications.
e.g. Aurora, MySQL, PostgreSQL, etc. Typically a row-store
architecture
• OLAP (Online Analytical Processing)
Characterized by relatively low volume of transactions, and
queries are often complex and involve aggregations against large
historical datasets for data-driven decision making. e.g. Amazon
Redshift, Greenplum, etc. Typically a column-store architecture
• Data Lake
An architectural paradigm that allows customers to store all of
their data in a single unified place where they can collect and
store any data, at any scale, and at low cost. Data lakes
complement (not replace) other data stores such as data
warehouses. e.g. S3 data lake
OLTP
PostgreSQL
Amazon
Aurora
Amazon EC2
(Business Application)User
Applications
DataLake
Data Stores: What’s the Difference?
OLAP
ETL Tools
Amazon
QuickSight
Amazon Redshift
Amazon
Glue
BI Tools
OLTP ERP CRM LOBUser
Dashboards
18. Amazon S3 | AWS Glue
Any analytic
workload, any scale,
at the lowest possible
cost
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
On-premises
Data Movement
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
Real-time
Data Movement
Data Lake
19. There are lots of ingestion tools
Amazon S3
Process Consume
S3 Transfer
Acceleration
Data sources
Transactions
Web logs /
cookies
ERP
Connected
devices
20. Typical steps of building a data lake
Setup storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
22. Data preparation accounts for ~80% of the work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
23. Sample of steps required
Find sources
Create Amazon Simple Storage Service (Amazon S3) locations
Configure access policies
Map tables to Amazon S3 locations
ETL jobs to load and clean data
Create metadata access policies
Configure access from analytics services
Rinse and repeat for other:
data sets, users, and end-services
And more:
manage and monitor ETL jobs
update metadata catalog as data changes
update policies across services as users and permissions change
manually maintain cleansing scripts
create audit processes for compliance
…
Manual | Error-prone | Time consuming
24. Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean,
and transform data
Build a secure data lake in days
AWS Lake Formation
26. Register existing data or import new
Amazon S3 forms the storage layer for
Lake Formation
Register existing S3 buckets that
contain your data
Ask Lake Formation to create required
S3 buckets and import data into them
Data is stored in your account. You have
direct access to it. No lock-in.
Data Lake Storage
Data
Catalog
Access
Control
Data import
Lake Formation
Crawlers ML-based
data prep
27. Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental
28. With blueprints
You
1. Point us to the source
2. Tell us the location to load to
in your data lake
3. Specify how often you want to
load the data
Blueprints
1. Discover the source table(s)
schema
2. Automatically convert to the
target data format
3. Automatically partition the
data based on the
partitioning schema
4. Keep track of data that was
already processed
5. You can customize any of
the above
30. Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
Admin
31. Security permissions in Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets
and objects
Easily view policies granted to a
particular user
Audit all data access at one place
34. Serverless Query Processing
• Serverless query service for querying data in S3 using standard SQL with
no infrastructure to manage
• No data loading required; query directly from Amazon S3
• Use standard ANSI SQL queries with support for joins, JSON, and
window functions
• Support for multiple data formats include text, CSV, TSV, JSON, Avro,
ORC, Parquet
• Pay per query only when you’re running queries based on data scanned.
If you compress your data, you pay less and your queries run faster
Amazon
Athena
35. Querying it in Amazon Athena
Either Create a Crawler to
auto-generate schema
OR
Write a DDL on the Athena
console/API/ JDBC/ODBC
driver
Start Querying Data
36. Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
37. Amazon Redshift Speed: Three Highlights
Machine-learning based acceleration
1
2
Result-set Caching for Local & Data Lake Queries
(sub-second repeat
queries)
3
Constant improvements in performance for
real-world workloads
10x faster
than it was two years ago
Amazon Redshift is now
from over 200 features and enhancements
released due to lessons learnt from more than
10,000 customer deployments processing over 2 exabytes
of data every dayRedshift Spectrum caches
intermediate results that can
benefit different queries
38. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical representation of Forrester's call on a
market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any vendor, product, or service depicted in the Forrester Wave. Information is
based on best available resources. Opinions reflect judgment at the time and are subject to change.
The Forrester Wave™: Enterprise Data
Warehouse, Q4 2015
Forrester Wave™ Big Data Warehouse Q4 2018
AWS rated top in the
leader bracket and
received a score of
5/5 (the highest score
possible) in a number
of areas such as Use
Cases, Roadmap,
Market Awareness,
and Ability to Execute
AWS positioned as a
Leader in the Gartner
Magic Quadrant for
Data Management
Gartner Magic Quadrant, 2018
Solutions for
Analytics
39. Semi-structured/Unstructured Data Processing
• Hadoop, Hive, Presto, Spark, Tez, Impala etc.
• Release 5.2: Hadoop 2.7.3, Hive 2.1, Spark 2.02, Zeppelin, Presto, HBase 1.2.3 and HBase on
S3, Phoenix, Tez, Flink.
• New applications added within 30 days of their open source release
• Fully managed, Auto Scaling clusters with support for on-demand and
spot pricing
• Support for HDFS and S3 file systems enabling separated compute and
storage; multiple clusters can run against the same data in S3
• HIPAA-eligible. Support for end-to-end encryption, IAM/VPC, S3 client-
side encryption with customer managed keys and AWS KMS
Amazon EMR
44. Summary
• Data MUST be used in every organization
• Data lakes are very important to consume structured and
unstructured data
• Data lake governance
• Analyze data with the right tool
• Different type of consumers