SlideShare uma empresa Scribd logo
1 de 50
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Securing Your Big Data on AWS
Hannah Marlowe, PhD
Professional Services Consultant
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
What to expect from this session?
• Current data challenges
• Data Lake approach
• Encryption in transit and at rest
• Components of a Data Lake
• Security best practices
• Storage
• Metadata/Catalog
• Compute
• Customer examples
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data challenge today
Quickly ingest and store
any type of data, at any
scale, and at low cost
Have a single source of
truth and quickly search
and find the relevant
data
Easily query the data
through a unified set of
tools
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
For Data to Be a Differentiator, Customers Need
to Be Able to…
• Capture and store new non-relational
data at PB-EB scale in real time
• New type of analytics that go beyond
batch reporting to incorporate real-time,
predictive, voice, and image recognition
• Democratize access to data in a secure
and governed way
New types of data
New types of analytics
Dashboards Predictive Image
Recognition
VoiceReal-time
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
• Large initial CAPEX + $10K–$50K/TB/Year
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data Lakes Extend the Traditional Approach
Data Warehouse
Business Intelligence
OLTP ERP CRM LOB
• Relational and non-relational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
• Help locate, curate, and secure your data
• Provide democratized access to data
within your organization
Devices Web Sensors Social
Big Data processing,
real-time, Machine Learning
Data Lake
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Building a Data Lake on AWS
Athena
Query Service
Batch GlueIoT Lambda SageMaker
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
The primary components of a Data Lake
Architectural Layers of a
Data Lake (without security)
Storage
Compute
Metadata/Catalog
• Object storage – S3/Glacier
• Block storage - EBS
• File storage - EFS
• Attached instance store
• EC2 instance
• Redshift clusters
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
The primary components of a Data Lake
• Automatically Index data
• Easy search with
tags/business domain
• Curate and assign
relevancy score
• Easily commission and
decommission data sets
• Capture data lineage
Architectural Layers of a
Data Lake (without security)
Storage
Compute
Metadata/Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
The primary components of a Data Lake
Architectural Layers of a
Data Lake (without security)
• Server-based compute
• More than just standalone
EC2, also includes EMR,
Redshift
• Serverless compute (Lambda,
Athena, API Gateway, etc.)
• Hybrid
• Redshift SpectrumStorage
Compute
Metadata/Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Ubiquitous Encryption
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Encrypt data in transit
Protect data flows
Point “A” Point “B” Data flow protection
Enterprise data sources Amazon S3 Encrypted with SSL/TLS; S3 requests signed with AWS Sigv4
Amazon S3 Amazon EMR Encrypted with SSL/TLS
Amazon S3 Amazon Redshift Encrypted with SSL/TLS
Amazon EMR Clients Encrypted with SSL/TLS; varies with Hadoop application client
Amazon Redshift Clients Supports SSL/TLS; Requires configuration
Apache Hadoop on Amazon EMR
• Hadoop RPC encryption
• HDFS Block data transfer encryption
• KMS over HTTPS is not enabled by default with Hadoop KMS
• May vary with EMR release (such as Tez and Spark in release 5.0.0+)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Options for data-at-rest encryption in AWS
Client-side encryption
• You encrypt your data before submitting to an AWS service
• You supply encryption keys OR use keys in AWS Key Management Service under your
control
• Tools: AWS Encryption SDK, S3 Encryption Client, EMRFS Client, DynamoDB Encryption
Client
Server-side encryption
• AWS encrypts data on your behalf after it is received by the service
• 47 services including Amazon S3, Amazon EBS, Amazon RDS, Amazon Redshift,
Amazon WorkSpaces, Amazon Kinesis Streams, AWS CloudTrail…
• Integrated with AWS Key Management Service so that you control the keys
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Server-Side Encryption in AWS
Two-tiered key hierarchy using envelope encryption
• Unique data key encrypts customer data
• Customer Master Keys encrypt data keys
Benefits
• Limits risk of compromised data key
• Better performance for encrypting large data
• Easier to manage small number of master keys
than billions of data keys
• Centralized access and audit of key activity
Customer master
keys
Data key 1
S3 object EBS volume Amazon
Redshift cluster
Data key 2 Data key 3 Data key 4
Custom
application
KMS
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Storage
Compute
Metadata/Catalog
Storage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Scalable, highly available storage
• Designed for 11 9’s of durability
• Central Data Lake source of truth
• Store anything at any scale
• Query in place with Athena and Redshift
Spectrum
Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Secure S3 – preemptive controls
• Create buckets based on business domains
• Assign bucket policies
– Restrict by VPC, SSL, IP filters, KMS keys
• Restrict using Tags
• Enable encryption
• Enable versioning
• MFA delete
• Enable backups – across accounts/regions
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Manage permissions with tags
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject"
],
"Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*"
"Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}}
}
]
}
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Secure S3 data – detective controls
• Enable AWS config to detect s3 bucket level changes
– s3-blacklisted-actions-prohibited, s3-bucket-policy-not-more-permissive, s3-bucket-
logging-enabled, s3-bucket-public-read-prohibited, s3-bucket-public-write-prohibited,
s3-bucket-replication-enabled, s3-bucket-server-side-encryption-enabled, s3-bucket-
ssl-requests-only, s3-bucket-versioning-enabled
• S3 data access audit using Cloudtrail – Log to separate cloudwatch logs
– Kerberos enabled EMR clusters allows you to track AD user
• Use Amazon GuardDuty to detect unauthorized and unexpected activity
• Enable Amazon Macie to classify sensitive data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Fast, scalable data warehouse
• Deploy a data warehouse in minutes
• Massively Parallel Query Execution
• Column-based Storage
• Redshift Spectrum extends data
warehouse to data lake
Amazon Redshift
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Amazon Redshift Authentication
BI tools SQL clientsAnalytics tools
Redshift
ADFS
Corporate
Active Directory
IAM
Amazon Redshift
ODBC/JDBC
User groups Individual user
Single Sign-On
Identity providers
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Amazon Redshift Data Protection
In-Transit
• Amazon Redshift API calls are made over HTTPS
• SSL certificate for each Amazon Redshift cluster & SSL
required for connectivity
• Set require_ssl=true
At-Rest
• Enable cluster encryption on your Amazon Redshift cluster
• Supports client-side encryption using customer managed key
• Supports server-side encryption using SSE-KMS & SSE-S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Redshift Audit Logging
• For all Redshift API calls
• For all KMS API calls
• For all S3 API calls
• STL_DDLTEXT
• STL_QUERYTEXT
• STL_UTILITYTEXT
• STL_STATEMENTTEXT
• Connection Log
• User Log
• User Activity Log
Enable the
enable_user_activity_logging
database parameter
AWS CloudTrail Amazon Redshift Amazon S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Storage
Compute
Metadata/Catalog
Metadata/Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Central metadata repository
• Apache Hive metastore compatible
• Integrates with Hive, Spark, Presto,
Athena and Redshift spectrum
• Crawlers discover and classify data into
central searchable catalog
AWS Glue Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Glue data catalog – Resource policies
• Fine-grained access control to Catalog using IAM policies
• Restrict what they can view and query
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Create centralized Data catalog
• Create a Data curator role in your organization
• Using IAM policies to control Catalog access – similar to S3 bucket policies
• Enable metadata encryption using AWS KMS
• Use IAM “Deny” for “glue:Delete*” operations
Glue Catalog Best practices
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Storage
Compute
Metadata/Catalog
Compute
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
• Hadoop, Spark, Presto, Hive, more
• Easy to use, fully managed
• Launch a cluster in minutes
• Baked in security features
• Pay by the hour and save with Spot
Amazon EMR
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Master instance
group
Core instance group
HDFS HDFS HDFS
Task instance group
Amazon EMR
Amazon
S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Amazon EMR network security
• Create Clusters in private VPC subnets
• S3 access via VPC API endpoint
• AWS API access via NAT gateway
• Locked down security groups
• AWS Direct Connect from on-prem
AWS Direct
Connect
Amazon
EMR cluster
Private VPC subnet
Public VPC subnet
VPC NAT
gateway
S3 VPC
endpoint
AWS APIs
Amazon
S3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
EMR authentication
• Configure Kerberos for cluster authentication
• Perimeter security using Apache Knox
• Simplify authentication of various Hadoop services and UI's
• Mask service specific URL's/Ports by acting as a Proxy
• Enable SSL termination at the perimeter
• Ease management of published endpoints across multiple clusters
Knox
Gateway
MIT Kerberos
{REST}
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Amazon EMRFS
• Sits between Amazon EMR and and
Amazon S3
• EMR clusters use EMRFS for
reading and writing files from
Amazon S3
• Provides consistent view and data
encryption
• Use different IAM roles for EMRFS
requests to Amazon S3
• These IAM roles can be cluster
users, groups, or the location of
EMRFS data in Amazon S3
EMRFS
Amazon
EMR
Amazon
S3
AD Users
Data
Scientist
Developer
Business
User
IAM Roles
Analyst
Role
Developer
Role
Business
User Role
EMRFS
Security
Configuration
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
EMR – Data Protection
Amazon
S3
Amazon
EMR
At-rest data encryption
• For cluster nodes (EC2 instance volumes)
• Open-source HDFS encryption
• LUKS encryption for local volumes
HDFS
(Block-transfer
and RPC)
Local volumes
(Instance
Store/EBS)
In-transit data encryption
• For distributed applications
• Open-source encryption
functionality
In-transit data encryption
• For EMRFS traffic between S3 and
cluster nodes (enabled automatically
• TLS encryption
At-rest data encryption
• For EMRFS on S3
• Server-side or client-side encryption
• SSE-S3, SSE-KMS, CSE-KMS, or CSE-Custom
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
EMR Audit Logging
• For all EMR API calls
• For all KMS API calls
• For all S3 API calls
• /JobFlowId/node/
• /JobFlowId/steps/N/
• /JobFlowId/containers
AWS
CloudTrail
Amazon
EMR
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Data Lake Solutions in the Wild
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
FINRA oversees > 3,000
securities firms doing
business in the United States.
Challenge:
FINRA’s legacy system did not
scale well
• Up to 75 billion events per day
• Run complex surveillance queries
over 20+ PB of data
Solution:
• Migrated their big data appliance
to a S3 Data Lake and used EMR
for ingestion and processing
• Migrated to RDS and testing Aurora
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
FINRA uses S3 to Build Data Lake with EMR
• Required fast access
across trillions of trade
records (20PB+)
• Migrated from
on-premises system
• Use Apache HBase on
Amazon EMR to store
and serve this data
• Use EMR engines—
Spark, Presto, and Hive
to process data
• Lower costs by 60% over
on-premises system
Spark
on EMR
Presto
on EMR
Hive
on EMR
S3
Herd
Metastore
HBase
on EMR
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Nasdaq operates financial
exchanges around the
world, and processes
large volumes of data.
Challenge:
Nasdaq wanted to make their large
historical data footprint available
to analyze as a single dataset.
Solution:
• Use Amazon Redshift for
interactive querying
• Use Amazon S3 as a Data Lake,
and Presto on EMR to process
historical data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Nasdaq Uses AWS to Build a Data Lake
• Migrate legacy
on-premises warehouse
to Amazon Redshift
• 4.8B rows inserted
per trading day
(orders, trades, quotes)
• Ingest data from multiple
sources, validates, and
stages in S3
• Redshift reads data out of
S3 for fast queries
• Presto on EMR and S3 used
for analysis of massive
historical data set
Data from all 7 exchanges
operated by Nasdaq
(orders, quotes, trade executions)
Flat
files
Operational
Databases
EMR
Redshift
S3
SQL
Clients
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Zillow operates a portfolio of
the largest online real-estate,
and home-related brands.
Challenge:
Needed to analyze 100M homes
with fast, massively parallel
machine-learning jobs.
Solution:
• Use Spark on Amazon EMR, and
Amazon Kinesis power Zillow’s
site (personalization, advertising
optimization, and recommendations)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Zillow Uses AWS to Build Personalized Website
• Ingests data public data
(property records, recent
sales) into Kinesis
• Spark on EMR performs
ML to provide real-time
home estimates
• Store data into S3
Data Lake
• Use data to do
personalization,
advertising optimization,
and recommendations
for website
Zestimates home
valuation estimates
Personalization, advertising
optimization, and
recommendations for website
Public-property records,
home tax assessments,
sales transactions, images,
video, MLS-listing data, and
user-provided data
Kinesis
S3
Spark
on EMR
Public
Data
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Asurion is a leader in
providing customer support
and protection for smartphones,
tablets, consumer electronics,
and appliances.
Challenge:
Had a wide variety of relational (from
OLTP databases), and non-relational
data (from telephony, voice-to-text,
claims data, and social) from >290M
customers. Wanted to do analytics.
Solution:
• Use S3 as a data lake to store all
data in a single location
• Use EMR to process raw data
• Use Redshift for fast BI
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Asurion Uses AWS for Data Lakes and Analytics
• Collect data from a variety
of sources –OLTP, from
telephony, voice-to-text,
claims data, and social
• Real-time data is streamed
with Kinesis and Lambda
• Data lands in S3 Data Lake
and processed with EMR to
process raw data
• Redshift provides fast
analytics for BI tools
Data from
other
applications
Data from
OLTP
Application
Events &
Logging
Domain
applications
EMR
Data preparation: Process raw
data into meaningful content
Access
Virtualization
(AD and User
Profiles)
3rd party
BI Tools
Orchestration – Jobs, Plans, Workflows – Enterprise Scheduler – Information Lifecycle Management
Real-time data
Redshift
DynamoDB
S3
Kinesis
Data
Collection
Services
(ODBC/
JDBC, CDC,
Event
Streaming,
APIs)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Airbnb is a community
marketplace that allows
property owners and travelers
to connect with each other.
Challenge:
Grows data 3x every year with PBs
of data stored. Use Hadoop/HDFS,
but experienced bottlenecks in
performance and high costs.
Solution:
• Created a tiered storage system:
Land hot data in HDFS, and all
warm/cold data in S3 data lake
• S3 provides infinite storage at
lower costs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Airbnb Uses AWS for data lake and analytics
• Land hot data in HDFS
• Warm/cold data in S3
• Brings the best of both—
performance, scalability, cost
• Analyze data with Hive,
Presto, Spark, etc.
Hive on EMR
HDFS Cluster
S3
Spark on EMR
Presto on EMR
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Amazon.com’s vision is to be
the earth’s most customer—
centric company; where people
can find anything they want
to buy online.
Challenge:
Load 500K+ transactions each day, and
serve 300K+ queries/extracts each day
from Amazon businsses (Amazon.com,
Amazon Prime, Amazon Music, Amazon
Alexa, Amazon Video, and Twitch).
Solution:
• Land data in S3 as a data lake
• Use Redshift as preferred SQL based
analysis by business users, and EMR
for machine learning
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Amazon.com Uses AWS for Data Lakes & Analytics
• DynamoDB capturing all
Amazon.com transactions
• Everything from
DynamoDB, RDS
PostgreSQL and Kinesis fed
to a S3 data lake
• Glue used to catalog
the data
• Redshift used for all SQL-
based queries, and EMR for
all machine learning and big
data processing
• End-users use QuickSight
for visualizations
AWS Glue
Catalog
QuickSight
S3 Athena
EMR
DynamoDB
PostgreSQL
Kinesis
Redshift
Machine
Learning
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
Thank you!
marloweh@amazon.com
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved
aws.amazon.com/activate
Everything and Anything Startups
Need to Get Started on AWS

Mais conteúdo relacionado

Mais procurados

AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...Amazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Amazon Web Services
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSAmazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You ScaleENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You ScaleAmazon Web Services
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Amazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012infolive
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Amazon Web Services
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewAmazon Web Services
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSAmazon Web Services
 
BDA309 Building Your Data Lake on AWS
BDA309 Building Your Data Lake on AWSBDA309 Building Your Data Lake on AWS
BDA309 Building Your Data Lake on AWSAmazon Web Services
 
(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWSAmazon Web Services
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSAmazon Web Services
 
Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017Amazon Web Services
 

Mais procurados (20)

AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
AWS November Webinar Series - Architectural Patterns & Best Practices for Big...
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
Think Big Data, Think Cloud - AWS Presentation - AWS Cloud Storage for the En...
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
AWS Big Data Platform
AWS Big Data PlatformAWS Big Data Platform
AWS Big Data Platform
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You ScaleENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale
ENT316 Keeping Pace With The Cloud: Managing and Optimizing as You Scale
 
Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301Building a Server-less Data Lake on AWS - Technical 301
Building a Server-less Data Lake on AWS - Technical 301
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
16h00 globant - aws globant-big-data_summit2012
16h00   globant - aws globant-big-data_summit201216h00   globant - aws globant-big-data_summit2012
16h00 globant - aws globant-big-data_summit2012
 
Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200Building Your Data Lake on AWS - Level 200
Building Your Data Lake on AWS - Level 200
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 
Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
Builders' Day - Best Practises for S3 - BL
Builders' Day - Best Practises for S3 - BLBuilders' Day - Best Practises for S3 - BL
Builders' Day - Best Practises for S3 - BL
 
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWSFebruary 2016 Webinar Series - Architectural Patterns for Big Data on AWS
February 2016 Webinar Series - Architectural Patterns for Big Data on AWS
 
BDA309 Building Your Data Lake on AWS
BDA309 Building Your Data Lake on AWSBDA309 Building Your Data Lake on AWS
BDA309 Building Your Data Lake on AWS
 
(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS(BDT402) Delivering Business Agility Using AWS
(BDT402) Delivering Business Agility Using AWS
 
Best Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWSBest Practices for Building a Data Lake on AWS
Best Practices for Building a Data Lake on AWS
 
AWS Analytics
AWS AnalyticsAWS Analytics
AWS Analytics
 
Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017Building Serverless Web Applications - DevDay Los Angeles 2017
Building Serverless Web Applications - DevDay Los Angeles 2017
 

Semelhante a Securing Your Big Data on AWS

Data Protection in Transit and at Rest
Data Protection in Transit and at RestData Protection in Transit and at Rest
Data Protection in Transit and at RestAmazon Web Services
 
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018Amazon Web Services
 
Data Protection in Transit and at Rest
Data Protection in Transit and at RestData Protection in Transit and at Rest
Data Protection in Transit and at RestAmazon Web Services
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...AWS Riyadh User Group
 
Data Protection in Transit and at Rest
Data Protection in Transit and at RestData Protection in Transit and at Rest
Data Protection in Transit and at RestAmazon Web Services
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
Cloud Adoption Framework: Security Perspective - CAF Data Protection in Trans...
Cloud Adoption Framework: Security Perspective - CAF Data Protection in Trans...Cloud Adoption Framework: Security Perspective - CAF Data Protection in Trans...
Cloud Adoption Framework: Security Perspective - CAF Data Protection in Trans...Amazon Web Services
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSLam Le
 
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...Amazon Web Services
 
Building Hybrid Cloud Storage Architectures with AWS
Building Hybrid Cloud Storage Architectures with AWSBuilding Hybrid Cloud Storage Architectures with AWS
Building Hybrid Cloud Storage Architectures with AWSAmazon Web Services
 
Building Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleBuilding Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleAmazon Web Services
 
Getting Started with AWS Security
Getting Started with AWS SecurityGetting Started with AWS Security
Getting Started with AWS SecurityAmazon Web Services
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczAmazon Web Services
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon Web Services
 
Migrating your IT - AWS Summit Cape Town 2018
Migrating your IT - AWS Summit Cape Town 2018Migrating your IT - AWS Summit Cape Town 2018
Migrating your IT - AWS Summit Cape Town 2018Amazon Web Services
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Amazon Web Services
 
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...Amazon Web Services
 

Semelhante a Securing Your Big Data on AWS (20)

Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Data Protection in Transit and at Rest
Data Protection in Transit and at RestData Protection in Transit and at Rest
Data Protection in Transit and at Rest
 
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018
Best Practices to Secure Data Lake on AWS (ANT327) - AWS re:Invent 2018
 
Data Protection in Transit and at Rest
Data Protection in Transit and at RestData Protection in Transit and at Rest
Data Protection in Transit and at Rest
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
Data Protection in Transit and at Rest
Data Protection in Transit and at RestData Protection in Transit and at Rest
Data Protection in Transit and at Rest
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
Cloud Adoption Framework: Security Perspective - CAF Data Protection in Trans...
Cloud Adoption Framework: Security Perspective - CAF Data Protection in Trans...Cloud Adoption Framework: Security Perspective - CAF Data Protection in Trans...
Cloud Adoption Framework: Security Perspective - CAF Data Protection in Trans...
 
Module 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWSModule 1 - CP Datalake on AWS
Module 1 - CP Datalake on AWS
 
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
Using AWS CloudTrail to Enhance Governance and Compliance of Amazon S3 - DEV3...
 
Building Hybrid Cloud Storage Architectures with AWS
Building Hybrid Cloud Storage Architectures with AWSBuilding Hybrid Cloud Storage Architectures with AWS
Building Hybrid Cloud Storage Architectures with AWS
 
Building Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scaleBuilding Hybrid Cloud Storage Architectures with AWS @scale
Building Hybrid Cloud Storage Architectures with AWS @scale
 
Getting Started with AWS Security
Getting Started with AWS SecurityGetting Started with AWS Security
Getting Started with AWS Security
 
AWS Storage Stage of Union
AWS Storage Stage of UnionAWS Storage Stage of Union
AWS Storage Stage of Union
 
Using Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter DachnowiczUsing Search with a Database - Peter Dachnowicz
Using Search with a Database - Peter Dachnowicz
 
Amazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage OverviewAmazon S3 & Amazon Glacier - Object Storage Overview
Amazon S3 & Amazon Glacier - Object Storage Overview
 
Migrating your IT - AWS Summit Cape Town 2018
Migrating your IT - AWS Summit Cape Town 2018Migrating your IT - AWS Summit Cape Town 2018
Migrating your IT - AWS Summit Cape Town 2018
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
 
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
Storage Data Management: Tools and Templates to Seamlessly Automate and Optim...
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Securing Your Big Data on AWS

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Securing Your Big Data on AWS Hannah Marlowe, PhD Professional Services Consultant
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved What to expect from this session? • Current data challenges • Data Lake approach • Encryption in transit and at rest • Components of a Data Lake • Security best practices • Storage • Metadata/Catalog • Compute • Customer examples
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data challenge today Quickly ingest and store any type of data, at any scale, and at low cost Have a single source of truth and quickly search and find the relevant data Easily query the data through a unified set of tools
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved For Data to Be a Differentiator, Customers Need to Be Able to… • Capture and store new non-relational data at PB-EB scale in real time • New type of analytics that go beyond batch reporting to incorporate real-time, predictive, voice, and image recognition • Democratize access to data in a secure and governed way New types of data New types of analytics Dashboards Predictive Image Recognition VoiceReal-time
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Traditionally, Analytics Used to Look Like This OLTP ERP CRM LOB Data Warehouse Business Intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc • Large initial CAPEX + $10K–$50K/TB/Year
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data Lakes Extend the Traditional Approach Data Warehouse Business Intelligence OLTP ERP CRM LOB • Relational and non-relational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics • Help locate, curate, and secure your data • Provide democratized access to data within your organization Devices Web Sensors Social Big Data processing, real-time, Machine Learning Data Lake
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Building a Data Lake on AWS Athena Query Service Batch GlueIoT Lambda SageMaker
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved The primary components of a Data Lake Architectural Layers of a Data Lake (without security) Storage Compute Metadata/Catalog • Object storage – S3/Glacier • Block storage - EBS • File storage - EFS • Attached instance store • EC2 instance • Redshift clusters
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved The primary components of a Data Lake • Automatically Index data • Easy search with tags/business domain • Curate and assign relevancy score • Easily commission and decommission data sets • Capture data lineage Architectural Layers of a Data Lake (without security) Storage Compute Metadata/Catalog
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved The primary components of a Data Lake Architectural Layers of a Data Lake (without security) • Server-based compute • More than just standalone EC2, also includes EMR, Redshift • Serverless compute (Lambda, Athena, API Gateway, etc.) • Hybrid • Redshift SpectrumStorage Compute Metadata/Catalog
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Ubiquitous Encryption
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Encrypt data in transit Protect data flows Point “A” Point “B” Data flow protection Enterprise data sources Amazon S3 Encrypted with SSL/TLS; S3 requests signed with AWS Sigv4 Amazon S3 Amazon EMR Encrypted with SSL/TLS Amazon S3 Amazon Redshift Encrypted with SSL/TLS Amazon EMR Clients Encrypted with SSL/TLS; varies with Hadoop application client Amazon Redshift Clients Supports SSL/TLS; Requires configuration Apache Hadoop on Amazon EMR • Hadoop RPC encryption • HDFS Block data transfer encryption • KMS over HTTPS is not enabled by default with Hadoop KMS • May vary with EMR release (such as Tez and Spark in release 5.0.0+)
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Options for data-at-rest encryption in AWS Client-side encryption • You encrypt your data before submitting to an AWS service • You supply encryption keys OR use keys in AWS Key Management Service under your control • Tools: AWS Encryption SDK, S3 Encryption Client, EMRFS Client, DynamoDB Encryption Client Server-side encryption • AWS encrypts data on your behalf after it is received by the service • 47 services including Amazon S3, Amazon EBS, Amazon RDS, Amazon Redshift, Amazon WorkSpaces, Amazon Kinesis Streams, AWS CloudTrail… • Integrated with AWS Key Management Service so that you control the keys
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Server-Side Encryption in AWS Two-tiered key hierarchy using envelope encryption • Unique data key encrypts customer data • Customer Master Keys encrypt data keys Benefits • Limits risk of compromised data key • Better performance for encrypting large data • Easier to manage small number of master keys than billions of data keys • Centralized access and audit of key activity Customer master keys Data key 1 S3 object EBS volume Amazon Redshift cluster Data key 2 Data key 3 Data key 4 Custom application KMS
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Storage Compute Metadata/Catalog Storage
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved • Scalable, highly available storage • Designed for 11 9’s of durability • Central Data Lake source of truth • Store anything at any scale • Query in place with Athena and Redshift Spectrum Amazon S3
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Secure S3 – preemptive controls • Create buckets based on business domains • Assign bucket policies – Restrict by VPC, SSL, IP filters, KMS keys • Restrict using Tags • Enable encryption • Enable versioning • MFA delete • Enable backups – across accounts/regions
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Manage permissions with tags { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject" ], "Resource": "arn:aws:s3:::EXAMPLE-BUCKET-NAME/*" "Condition": {"StringEquals": {"S3:ResourceTag/HIPAA":"True"}} } ] }
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Secure S3 data – detective controls • Enable AWS config to detect s3 bucket level changes – s3-blacklisted-actions-prohibited, s3-bucket-policy-not-more-permissive, s3-bucket- logging-enabled, s3-bucket-public-read-prohibited, s3-bucket-public-write-prohibited, s3-bucket-replication-enabled, s3-bucket-server-side-encryption-enabled, s3-bucket- ssl-requests-only, s3-bucket-versioning-enabled • S3 data access audit using Cloudtrail – Log to separate cloudwatch logs – Kerberos enabled EMR clusters allows you to track AD user • Use Amazon GuardDuty to detect unauthorized and unexpected activity • Enable Amazon Macie to classify sensitive data
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved • Fast, scalable data warehouse • Deploy a data warehouse in minutes • Massively Parallel Query Execution • Column-based Storage • Redshift Spectrum extends data warehouse to data lake Amazon Redshift
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon Redshift Authentication BI tools SQL clientsAnalytics tools Redshift ADFS Corporate Active Directory IAM Amazon Redshift ODBC/JDBC User groups Individual user Single Sign-On Identity providers
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon Redshift Data Protection In-Transit • Amazon Redshift API calls are made over HTTPS • SSL certificate for each Amazon Redshift cluster & SSL required for connectivity • Set require_ssl=true At-Rest • Enable cluster encryption on your Amazon Redshift cluster • Supports client-side encryption using customer managed key • Supports server-side encryption using SSE-KMS & SSE-S3
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Redshift Audit Logging • For all Redshift API calls • For all KMS API calls • For all S3 API calls • STL_DDLTEXT • STL_QUERYTEXT • STL_UTILITYTEXT • STL_STATEMENTTEXT • Connection Log • User Log • User Activity Log Enable the enable_user_activity_logging database parameter AWS CloudTrail Amazon Redshift Amazon S3
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Storage Compute Metadata/Catalog Metadata/Catalog
  • 25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved • Central metadata repository • Apache Hive metastore compatible • Integrates with Hive, Spark, Presto, Athena and Redshift spectrum • Crawlers discover and classify data into central searchable catalog AWS Glue Catalog
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Glue data catalog – Resource policies • Fine-grained access control to Catalog using IAM policies • Restrict what they can view and query
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved • Create centralized Data catalog • Create a Data curator role in your organization • Using IAM policies to control Catalog access – similar to S3 bucket policies • Enable metadata encryption using AWS KMS • Use IAM “Deny” for “glue:Delete*” operations Glue Catalog Best practices
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Storage Compute Metadata/Catalog Compute
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved • Hadoop, Spark, Presto, Hive, more • Easy to use, fully managed • Launch a cluster in minutes • Baked in security features • Pay by the hour and save with Spot Amazon EMR
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Master instance group Core instance group HDFS HDFS HDFS Task instance group Amazon EMR Amazon S3
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon EMR network security • Create Clusters in private VPC subnets • S3 access via VPC API endpoint • AWS API access via NAT gateway • Locked down security groups • AWS Direct Connect from on-prem AWS Direct Connect Amazon EMR cluster Private VPC subnet Public VPC subnet VPC NAT gateway S3 VPC endpoint AWS APIs Amazon S3
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved EMR authentication • Configure Kerberos for cluster authentication • Perimeter security using Apache Knox • Simplify authentication of various Hadoop services and UI's • Mask service specific URL's/Ports by acting as a Proxy • Enable SSL termination at the perimeter • Ease management of published endpoints across multiple clusters Knox Gateway MIT Kerberos {REST}
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon EMRFS • Sits between Amazon EMR and and Amazon S3 • EMR clusters use EMRFS for reading and writing files from Amazon S3 • Provides consistent view and data encryption • Use different IAM roles for EMRFS requests to Amazon S3 • These IAM roles can be cluster users, groups, or the location of EMRFS data in Amazon S3 EMRFS Amazon EMR Amazon S3 AD Users Data Scientist Developer Business User IAM Roles Analyst Role Developer Role Business User Role EMRFS Security Configuration
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved EMR – Data Protection Amazon S3 Amazon EMR At-rest data encryption • For cluster nodes (EC2 instance volumes) • Open-source HDFS encryption • LUKS encryption for local volumes HDFS (Block-transfer and RPC) Local volumes (Instance Store/EBS) In-transit data encryption • For distributed applications • Open-source encryption functionality In-transit data encryption • For EMRFS traffic between S3 and cluster nodes (enabled automatically • TLS encryption At-rest data encryption • For EMRFS on S3 • Server-side or client-side encryption • SSE-S3, SSE-KMS, CSE-KMS, or CSE-Custom
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved EMR Audit Logging • For all EMR API calls • For all KMS API calls • For all S3 API calls • /JobFlowId/node/ • /JobFlowId/steps/N/ • /JobFlowId/containers AWS CloudTrail Amazon EMR
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Data Lake Solutions in the Wild
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved FINRA oversees > 3,000 securities firms doing business in the United States. Challenge: FINRA’s legacy system did not scale well • Up to 75 billion events per day • Run complex surveillance queries over 20+ PB of data Solution: • Migrated their big data appliance to a S3 Data Lake and used EMR for ingestion and processing • Migrated to RDS and testing Aurora
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved FINRA uses S3 to Build Data Lake with EMR • Required fast access across trillions of trade records (20PB+) • Migrated from on-premises system • Use Apache HBase on Amazon EMR to store and serve this data • Use EMR engines— Spark, Presto, and Hive to process data • Lower costs by 60% over on-premises system Spark on EMR Presto on EMR Hive on EMR S3 Herd Metastore HBase on EMR
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Nasdaq operates financial exchanges around the world, and processes large volumes of data. Challenge: Nasdaq wanted to make their large historical data footprint available to analyze as a single dataset. Solution: • Use Amazon Redshift for interactive querying • Use Amazon S3 as a Data Lake, and Presto on EMR to process historical data
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Nasdaq Uses AWS to Build a Data Lake • Migrate legacy on-premises warehouse to Amazon Redshift • 4.8B rows inserted per trading day (orders, trades, quotes) • Ingest data from multiple sources, validates, and stages in S3 • Redshift reads data out of S3 for fast queries • Presto on EMR and S3 used for analysis of massive historical data set Data from all 7 exchanges operated by Nasdaq (orders, quotes, trade executions) Flat files Operational Databases EMR Redshift S3 SQL Clients
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Zillow operates a portfolio of the largest online real-estate, and home-related brands. Challenge: Needed to analyze 100M homes with fast, massively parallel machine-learning jobs. Solution: • Use Spark on Amazon EMR, and Amazon Kinesis power Zillow’s site (personalization, advertising optimization, and recommendations)
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Zillow Uses AWS to Build Personalized Website • Ingests data public data (property records, recent sales) into Kinesis • Spark on EMR performs ML to provide real-time home estimates • Store data into S3 Data Lake • Use data to do personalization, advertising optimization, and recommendations for website Zestimates home valuation estimates Personalization, advertising optimization, and recommendations for website Public-property records, home tax assessments, sales transactions, images, video, MLS-listing data, and user-provided data Kinesis S3 Spark on EMR Public Data
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Asurion is a leader in providing customer support and protection for smartphones, tablets, consumer electronics, and appliances. Challenge: Had a wide variety of relational (from OLTP databases), and non-relational data (from telephony, voice-to-text, claims data, and social) from >290M customers. Wanted to do analytics. Solution: • Use S3 as a data lake to store all data in a single location • Use EMR to process raw data • Use Redshift for fast BI
  • 44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Asurion Uses AWS for Data Lakes and Analytics • Collect data from a variety of sources –OLTP, from telephony, voice-to-text, claims data, and social • Real-time data is streamed with Kinesis and Lambda • Data lands in S3 Data Lake and processed with EMR to process raw data • Redshift provides fast analytics for BI tools Data from other applications Data from OLTP Application Events & Logging Domain applications EMR Data preparation: Process raw data into meaningful content Access Virtualization (AD and User Profiles) 3rd party BI Tools Orchestration – Jobs, Plans, Workflows – Enterprise Scheduler – Information Lifecycle Management Real-time data Redshift DynamoDB S3 Kinesis Data Collection Services (ODBC/ JDBC, CDC, Event Streaming, APIs)
  • 45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Airbnb is a community marketplace that allows property owners and travelers to connect with each other. Challenge: Grows data 3x every year with PBs of data stored. Use Hadoop/HDFS, but experienced bottlenecks in performance and high costs. Solution: • Created a tiered storage system: Land hot data in HDFS, and all warm/cold data in S3 data lake • S3 provides infinite storage at lower costs
  • 46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Airbnb Uses AWS for data lake and analytics • Land hot data in HDFS • Warm/cold data in S3 • Brings the best of both— performance, scalability, cost • Analyze data with Hive, Presto, Spark, etc. Hive on EMR HDFS Cluster S3 Spark on EMR Presto on EMR
  • 47. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon.com’s vision is to be the earth’s most customer— centric company; where people can find anything they want to buy online. Challenge: Load 500K+ transactions each day, and serve 300K+ queries/extracts each day from Amazon businsses (Amazon.com, Amazon Prime, Amazon Music, Amazon Alexa, Amazon Video, and Twitch). Solution: • Land data in S3 as a data lake • Use Redshift as preferred SQL based analysis by business users, and EMR for machine learning
  • 48. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Amazon.com Uses AWS for Data Lakes & Analytics • DynamoDB capturing all Amazon.com transactions • Everything from DynamoDB, RDS PostgreSQL and Kinesis fed to a S3 data lake • Glue used to catalog the data • Redshift used for all SQL- based queries, and EMR for all machine learning and big data processing • End-users use QuickSight for visualizations AWS Glue Catalog QuickSight S3 Athena EMR DynamoDB PostgreSQL Kinesis Redshift Machine Learning
  • 49. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved Thank you! marloweh@amazon.com
  • 50. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved aws.amazon.com/activate Everything and Anything Startups Need to Get Started on AWS