Mais conteúdo relacionado Semelhante a Building a modern data platform in the cloud. AWS DevDay Nordics (20) Mais de javier ramirez (20) Building a modern data platform in the cloud. AWS DevDay Nordics1. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
N O R D I C S
04.03.19
2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
N O R D I C S
04.02.19
Building a Modern Data Platform in
the Cloud
Javier Ramirez
AWS Tech Evangelist
@supercoco9
D A T 1
3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, analytics used to feel like this
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence • Very rigid
• Limited to some structured data
• Quite hard
• Slow (days/weeks/months)
• Incomplete
• Hard to scale (closed source, closed
documentation, vertical scaling)
4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Organizations that successfully
generate business value from their
data, will outperform their peers. An
Aberdeen survey saw organizations
who implemented a Data Lake
outperforming similar companies by
9% in organic revenue growth.*
24%
15%
Leaders Followers
Organic revenue growth
*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
To Become a Leader, Data is Your Differentiator
5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Solution
My reports make
my database
server very slow
Before 2009
The DBA years
Overnight DB dump
Read-only replica
My data doesn’t fit in
one machine
And it’s not only
transactional
2009-2011
The Hadoop epiphany
Hadoop
Map/Reduce all the
things
My data is very
fast
Map/Reduce is
hard to use
2012-2014
The Message Broker
and NoSQL Age
Kafka/RabbitMQ
Cassandra/HBASE
/STORM
Basic ETL
Hive
Duplicating batch/stream is inefficient
I need to cleanse my source data
Hadoop ecosystem is hard to manage
My data scientists don’t like JAVA
I am not sure which data we are
already processing
2015-2017
The Spark kingdom and
the spreadsheet wars
Kafka/Spark
Complex ETL
Create new departments for data
governance
Spreadsheet all the things
Streaming is hard
My schemas have evolved
I cannot query old and new
data together
My cluster is running old
versions. Upgrading is hard
I want to use ML
2017-2018
The myth of DataOps
Kafka/Flink (JAVA or Scala
required)
Complex ETL with a pinch of
ML
Apache Atlas
Commercial distributions
6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some problems during all periods
Main problems
• My team spends more time maintaining the cluster than adding functionality
• Security and monitoring are hard
• Most of my time my cluster is sitting idle; Then it’s a bottleneck
• I don’t have the time to experiment
• Data preparation, cleansing, and basic transformations take a
disproportionally high amount of my time. And it’s so frustrating
7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Some things that scare me
• Text encodings
• Empty strings. Literal ”NULL” strings
• Uppercase and Lowercase
• Date and time formats: which date would you say this is 1/4/19? And this? 1553589297
• CSV, especially if uploaded by end users
• JSON files with a single array and 200.000 records inside
• The same JSON file when row 176.543 has a column never seen before
• The same JSON file when all the numbers are strings
• XML
8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The downfall of the data engineer
Watching paint dry is exciting in comparison to writing and maintaining Extract
Transform and Load (ETL) logic. Most ETL jobs take a long time to execute and errors
or issues tend to happen at runtime or are post-runtime assertions. Since the
development time to execution time ratio is typically low, being productive means
juggling with multiple pipelines at once and inherently doing a lot of context
switching. By the time one of your 5 running “big data jobs” has finished, you have to
get back in the mind space you were in many hours ago and craft your next iteration.
Depending on how caffeinated you are, how long it’s been since the last iteration, and
how systematic you are, you may fail at restoring the full context in your short term
memory. This leads to systemic, stupid errors that waste hours.
“
”Maxime Beauchemin, Data engineer extraordinaire at Lyft, creator of Apache Airflow and Apache Superset.
Ex-Facebook, Ex-Yahoo!, Ex-Airbnb
https://medium.com/@maximebeauchemin/the-downfall-of-the-data-engineer-5bfb701e5d6b
9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Solution
10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More data lakes & analytics on AWS than anywhere else
11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From On-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide a
more consistent network
experience than Internet-
based connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications
14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
Managed Streaming
For Kafka
Fully managed open-
source platform for
building real-time
streaming data pipelines
and applications.
16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Unmatched Durability and Availability
Scalable and durable
• Designed to deliver 99.999999999% durability
• Geographic redundancy & automatic replication
• Store data in multiple data centers across 3 AZs in
a single region
• Seamlessly replicates data between any region
(But don’t run analytics across regions. Latency
and cost will not be efficient)
18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Any Scale
Scalable and durable
• S3 has trillions of objects and exabytes of data
• Built to store any amount of data
• Runs on the world’s largest global
cloud infrastructure
19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Secure
Log and monitor with
CloudTrail, Vault Lock
enables WORM storage
capabilities, helping
satisfy compliance
requirements
Retrieves data in
minutes
Three retrieval options to
fit your use case;
expedited retrievals with
Glacier Select can return
data in minutes
Inexpensive
Lowest cost AWS object
storage class, allowing
you to archive large
amounts of data at a very
low cost
$
20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Preparation Accounts for ~80% of the Work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for other
purposes (for example, analytics,
business relationships and
direct monetizing).
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data
22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Data Catalog
Make data discoverable
• Automatically discovers data and stores schema
• Catalog makes data searchable, and available for ETL
• Catalog contains table and job definitions
• Computes statistics to make queries efficient
Glue
Data Catalog
Discover data and
extract schema
Compliance
23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Crawlers automatically build your Data
Catalog and keep it in sync.
Automatically discover new data, extracts
schema definitions
Detect schema changes and version tables
Detect Hive style partitions on Amazon S3
Built-in classifiers for popular types; custom
classifiers using Grok expression
Run ad hoc or on a schedule; serverless – only
pay when crawler runs
AWS Glue Crawlers
Crawlers
Automatically catalog your data
24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
Make ETL scripting and deployment easy
• Automatically generates ETL code. Spark
(Scale/Python) or Python shell script.
• Code is customizable (demo later on. Yay!)
• Endpoints provided to edit, debug,
test code
• Jobs are scheduled or event-based
• Serverless
25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR—Big Data Processing
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
100110000100101011100
101010111001010100000
111100101100101010001
100001
27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR— More than just managed Hadoop
29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$
30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in S3 data lake
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Avro, & Parquet data formats
• Pay only for the amount of data scanned
31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Numbers are fun
Werner Vogels, Amazon’s CTO, AWS Summit San Francisco 2017
https://youtu.be/RpPf38L0HHU?t=3963
34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$
35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis—Real Time
time
Load data streams
into AWS data stores
Kinesis Data
Firehose
Build custom
applications that
analyze data streams
Kinesis Data
Streams
Capture, process, and
store video streams
for analytics
Kinesis Video
Streams
Analyze data streams
with SQL
Kinesis Data
Analytics
SQL
36. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example - Real-time Log Analytics With SQL
37. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
easy
Empower
everyone
Seamless
connectivity
Fast analysis Serverless
Now with ML superpowers!
38. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes, Analytics, and ML Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch service
Amazon Kinesis
Amazon QuickSight
Analytics
Machine Learning
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS Storage Gateway
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Real-time
Data Movement
On-premises
Data Movement
Data Lake on AWS
Storage | Archival Storage | Data Catalog
39. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes from AWS
Data Lake
on AWS
Cost-effective
Scalable and durable
Secure
Open and comprehensiveAnalyticsMachine Learning
Real-time Data
Movement
On-premises
Data Movement
40. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Provides Highest Levels of Security
Secure
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS WAF
Amazon Macie
VPC
Encryption
AWS Certification Manager
AWS Key Management
Service
Encryption at rest
Encryption in transit
Bring your own keys, HSM
support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customer need to have multiple levels of security, identity and access management,
encryption, and compliance to secure their data lake
41. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Compliance: Virtually Every Regulatory Agency
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G
42. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes from AWS
Data Lake
on AWS
Cost-effective
Scalable and durable
Secure
Open and comprehensiveAnalyticsMachine Learning
Real-time Data
Movement
On-premises
Data Movement
43. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
For example: Amazon S3 holds trillions of objects and
regularly peaks at millions of requests per second
TIME
CUSTOMERDATA
“…the scale at which AWS operates its public
cloud storage services dwarfs the other vendors in
this Magic Quadrant.”
- Gartner Magic Quadrant for Public Cloud Storage Services, Worldwide
Raj Bala, Arun Chandrasekaran, John McArthur, July 24, 2017
AWS Runs the Largest Global Cloud Infrastructure
Scalable and durable
44. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes from AWS
Data Lake
on AWS
Lowest cost
Scalable and durable
Secure
Open and comprehensiveAnalyticsMachine Learning
Real-time Data
Movement
On-premises
Data Movement
45. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Pay Only for the Resources You Use as you Scale
Lowest Cost
• Pay-as-you-go for the resources you consume
• As low as $0.05/GB scanned with Athena
• EMR and Athena can automatically scale down
resources after job completes, saving you costs
• Commit to a set term and save up to 75% with
Reserved Instance
• Run on spare compute capacity with EMR and
save up to 90% with Spot
Traditional approach leads to wasted capacity
Traditional: Rigid
AWS: Elastic
Capacity
Demand
Demand
Servers
Unmet demand
upset players
missed revenue
Excess capacity
wasted $$$
AWS approach: pay for the capacity you use
46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS databases and analytics
Broad and deep portfolio, built for builders
AWS Marketplace
Amazon Redshift
Data warehousing
Amazon EMR
Hadoop + Spark
Athena
Interactive analytics
Kinesis Analytics
Real-time
Amazon Elasticsearch service
Operational Analytics
RDS
MySQL, PostgreSQL, MariaDB,
Oracle, SQL Server
Aurora
MySQL, PostgreSQL
Amazon
QuickSight
Amazon
SageMaker
DynamoDB
Key value, Document
ElastiCache
Redis, Memcached
Neptune
Graph
Timestream
Time Series
QLDB
Ledger Database
S3/Amazon Glacier
AWS Glue
ETL & Data Catalog
Lake Formation
Data Lakes
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect
Data Movement
AnalyticsDatabases
Business Intelligence & Machine Learning
Data Lake
Managed
Blockchain
Blockchain
Templates
Blockchain
Amazon
Comprehend
Amazon
Rekognition
Amazon
Lex
Amazon
Transcribe
AWS DeepLens 250+ solutions
730+ Database
solutions
600+ Analytics
solutions
25+ Blockchain
solutions
20+ Data lake
solutions
30+ solutions
RDS on VMWare
47. CHALLENGE
Need to create constant feedback loop
for designers
Gain up-to-the-minute understanding
of gamer satisfaction to guarantee
gamers are engaged, thus resulting in
the most popular game played in the
world
Fortnite | 125+ million players
48. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Epic Games uses Data Lakes and analytics
Entire analytics platform running on AWS
S3 leveraged as a Data Lake
All telemetry data is collected with Kinesis
Real-time analytics done through Spark on EMR,
DynamoDB to create scoreboards and real-time queries
Use Amazon EMR for large batch data processing
Game designers use data to inform their decisions
Game
clients
Game
servers
Launcher
Game
services
N E A R R E A L T I M E P I P E L I N E
N E A R R E A L T I M E P I P E L I N E
Grafana
Scoreboards API
Limited Raw Data
(real time ad-hoc SQL)
User ETL
(metric definition)
Spark on EMR DynamoDB
NEAR REALTIME PIPELINES
BATCH PIPELINES
ETL using
EMR
Tableau/BI
Ad-hoc SQLS3
(Data Lake)
Kinesis
APIs
Databases
S3
Other
sources
49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo Overview
https://aws.amazon.com/blogs/big-data/harmonize-query-and-visualize-data-
from-various-providers-using-aws-glue-amazon-athena-and-amazon-quicksight/
70. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
71. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Typical steps of building a data lake
Setup Storage1
Move data2
Cleanse, prep, and
catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics
5
72. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building data lakes can still take months
73. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Lake Formation (join the preview)
Build, secure, and manage a data lake in days
Build a data lake in days,
not months
Build and deploy a fully
managed data lake with a few
clicks
Enforce security policies
across multiple services
Centrally define security,
governance, and auditing policies in
one place and enforce those policies
for all users and all applications
Combine different
analytics approaches
Empower analyst and data scientist
productivity, giving them self-
service discovery and safe access to
all data from a single catalog
74. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How it works: AWS Lake Formation
S3
IAM KMS
OLTP
ERP
CRM
LOB
Devices
Web
Sensors
Social Kinesis
Build Data Lakes quickly
• Identify, crawl, and catalog sources
• Ingest and clean data
• Transform into optimal formats
Simplify security management
• Enforce encryption
• Define access policies
• Implement audit login
Enable self-service and combined analytics
• Analysts discover all data available for analysis
from a single data catalog
• Use multiple analytics tools over the same data
Athena
Amazon
Redshift
AI Services
Amazon
EMR
Amazon
QuickSight
Data
Catalog
75. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Customer interest in AWS Lake Formation
“We are very excited about the launch of AWS Lake
Formation, which provides a central point of control to
easily load, clean, secure, and catalog data from thousands of
clients to our AWS-based data lake, dramatically reducing
our operational load. … Additionally, AWS Lake Formation
will be HIPAA compliant from day one …”
- Aaron Symanski, CTO, Change Healthcare
“I can’t wait for my team to get our hands on AWS Lake
Formation. With an enterprise-ready option like Lake
Formation, we will be able to spend more time deriving
value from our data rather than doing the heavy lifting
involved in manually setting up and managing our data lake.”
- Joshua Couch, VP Engineering, Fender Digital
76. Thank you!
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Javier Ramirez
@supercoco9
77. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Select AWS Glue customers