SlideShare uma empresa Scribd logo
1 de 35
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Giorgio Nobile, Solutions Architect
April 24, 2018
Data Lake: analisi di dati @
scale
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Challenges
Data variety and data volumes are increasing rapidly
Multiple Consumers and Applications
Ingest,
Discover,
Catalog,
Understand,
and Curate
all kinds of data
Quickly drive
new insights
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Purpose-built engines.
Right tool for the right job.
Customer Needs Come First
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon’s Purpose-Built Analytics Offerings
Collect Store Analyze
Amazon Kinesis
Firehose
AWS Direct
Connect
Amazon
Snowball
Amazon Kinesis
Analytics
Amazon Kinesis
Streams
Amazon S3 Amazon Glacier
Amazon
CloudSearch
Amazon RDS,
Amazon Aurora
Amazon Dynamo
DB
Amazon
Elasticsearch
Amazon EMR
Amazon
Redshift
Amazon
QuickSight
AWS Database Migration Service AWS Glue
Amazon Athena
Amazon AI
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditional Data Warehouse
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
Relational data
Terabytes to Petabytes scale
Schema defined prior to data load
Operational reporting
and ad hoc analysis
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes extend traditional warehouses
Relational and non-relational data
Terabytes to Exabytes scale
Schema defined during analysis
(Schema on Read)
Diverse analytical engines to gain insights
Designed for low cost storage and analytics
OLTP ERP CRM LOB
Data Warehouse
Business
Intelligence
Data Lake
1001100001001010111
0010101011100101010
0001011111011010
0011110010110010110
0100011000010
Devices Web Sensors Social
Data Catalog
Machine
Learning
DW
Queries
Big data
processing
Interactive Real-time
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Wide variety of ways to bring data in
Durability and availability at Exabyte scale
Security, compliance, and audit capabilities
Run any analytics on the same data without
movement
Scale storage and compute independently
Store at $0.023 / GB-month
Query for $0.05 / GB scanned
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Data Lakes on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Architecture
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight Amazon AI EMR Redshift
Athena Kinesis RDS
Central Storage
Secure, cost-effective
Storage in Amazon S3
S3
Snowball Database Migration
Service
Kinesis Firehose Direct Connect
Data Ingestion
Get your data into S3
Quickly and securely
Protect and Secure
Use entitlements to ensure data is secure and users’ identities are verified
Processing & Analytics
Use of predictive and prescriptive
analytics to gain better understanding
Security Token
Service
CloudWatch CloudTrail Key Management
Service
Catalog & Search
Access and search metadata
Access & User Interface
Give your users easy and secure access
DynamoDB Elasticsearch API Gateway Identity & Access
Management
Cognito
QuickSight A
Central Storage
Secure, cost-effective
Storage in Amazon S3
Glue ETL
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless Analytics
Deliver cost-effective analytic solutions faster
S3
Data Lake
Glue
(Data Catalog
and ETL)
RedShift
Spectrum
QuickSight
Serverless
Zero infrastructure
Zero administration
Pay only for what
you use, not for
idle resources
$
Availability and
fault tolerance
built in
Automatically
scales resources
with usage
Snowball
Snowmobile
Kinesis
Data Firehose
many other
sources
Other BI Tools
Amazon
Athena
Amazon
EMR
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
Storing is not enough, data needs to be discoverable
Dark data are the information
assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for
other purposes (for example,
analytics, business
relationships and direct
monetizing).
Gartner
CRM ERP Data warehouse Mainframe
data
Web Social Log
files
Machine
data
Semi-
structured
Unstructured
“
”
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—Serverless Data Catalog & ETL
Data Catalog ETL
Discover data and
extract schema
Auto-generate
customizable code
in Python and Spark
Automatically discovers data and stores schema
Data is immediately searchable
and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Crawlers: Automatic Schema Inference
semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char int
custom classifiers
Grok based parser
built-in classifiers
JSON parser
CSV parser
Parquet parser
…
bool
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Aurora
Redshift
Avro
Parquet
ORC
XML
JSON & BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressions
(ZIP, BZIP, GZIP, LZ4, Snappy)
What can Crawlers Classify?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Detecting Schema Similarity
name:
str
id: num
Schema A
root
addr
street: str city: str zip: num
name:
str
id: num
Schema B
root
addr: str
Schema similarity heuristic
 1 point for matching name
 1 point for matching data type
 Match when similarity index > 0.7
intersection
min(A,B)
7
8
.875sim
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Custom Classifiers
1. Write a custom classifier 2. Add it to your crawler
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Available partitions
Automatically Detect Partitions
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automatically update table version as data evolves
Automatic Schema Versioning
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Other Ways of Creating Tables
Call Glue’s CreateTable API
Create table manually Run Hive DDL statement
Apache Hive
Metastore
AWS GLUE ETL AWS GLUE
DATA CATALOG
Import from Apache Hive Metastore
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at any scale
Columnar storage
technology to improve I/O
efficiency and scale query
performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional data
warehouse solutions;
Start at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end;
extensive certification and
compliance
Analyze optimized data
formats on the latest SSD,
and all open data formats in
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New Dense Compute Node - DC2
2X Performance @ Same Price as DC1
3x more I/O with 30% better storage utilization than DC1
“We saw a 9x reduction in month-end reporting time
with Redshift DC2 nodes as compared to DC1”
- Bradley Todd,
Technical Architect, Liberty Mutual
NVMe SSD DDR4 Memory Intel E5-2686 v4 (Broadwell)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n a n S 3 d a t a l a k e
Exabyte Redshift SQL queries against S3
Join data across Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited
concurrency
CSV, ORC, Grok, Avro, & Parquet data formats
Pay only for the amount of data scanned
S3 data lakeRedshift data
Redshift Spectrum
query engine
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Load
Unload
Backup
Restore
Amazon Redshift Spectrum Architecture
Massively parallel, shared nothing columnar
architecture
Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing
Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, backup, restore
Amazon Redshift Spectrum nodes
• Execute queries directly against
Amazon S3
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
JDBC/ODBC
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
128GB RAM
16TB disk
16 coresCompute
Node
Leader
Node
Amazon S3
...
1 2 3 4 N
Amazon
Redshift
Spectrum
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift Spectrum
Quer y your data lak e
Amazon
Redshift
JDBC/ODBC
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
AWS Glue
Data Catalog
Redshift Spectrum
Scale-out serverless compute
Query
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY …
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Defining Data Lake Artifacts (Sc hema on Read)
Define an external schema in Amazon Redshift using the Data Catalog
CREATE external schema archived_trips
from data catalog database 'sampledb'
iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole'
region 'us-east-2';
View External Schemas
select * from svv_external_schemas
View External Schemas
select * from svv_external_tables
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake on Amazon S3 with AWS Glue
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS
Glue
Amazon
S3
Data
sources
Amazon
Redshift
Redshift
Spectrum
BI Tools
“Amazon Redshift Spectrum is a game changer for us. Reports that took minutes to produce
are now delivered in seconds. We like the ability scale compute on-demand to query Petabytes
of data in S3 in various open file formats.”
-- Rafi Ton, CEO, NUVIAD
NUVIAD is a marketing platform that helps media buyers optimize their mobile bidding
Use AWS for marketing campaign and bidding analytics
Scale S3 storage for unlimited data capacity
Use Spectrum for unlimited scale and query concurrency
80% performance gain using Parquet data format
—Data Lak e Analytic s with Reds hift Spec trum
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Demonstration
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Redshift IAM Callout.
Amazon Redshift needs authorization to access the Data Catalog in AWS
Glue and the data files in Amazon S3 on your behalf.
To provide that authorization, you first create an AWS Identity and Access
Management (IAM) role.
Then you attach the role to your cluster and provide Amazon Resource Name
(ARN) for the role in the Amazon Redshift CREATE EXTERNAL SCHEMA statement.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recommendations – 1 of 4
Use Amazon Redshift Spectrum to improve scan-intensive
concurrent workloads
Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers
that are independent of your cluster.
It pushes many compute-intensive tasks, such as predicate filtering and
aggregation, down to the Amazon Redshift Spectrum layer, so queries use
much less of your cluster’s processing capacity.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recommendations – 2 of 4
Query Optimize your Data Lake - Use Apache Parquet
Pay special attention to two interesting metrics:
• s3_scanned_rows
• s3query_returned_rows
You will notice a tremendous reduction in the amount of data that returns
from Amazon Redshift Spectrum to Amazon Redshift for the final processing
when compared to CSV files.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recommendations – 3 of 4
Query Optimize your Data Lake - Partition Parquet files
The following SQL to analyze the effectiveness of partition pruning.
If the query touches only a few partitions, you can verify if everything
behaves as expected:
• SELECT query, segment, max(assigned_partitions) as total_partitions,
max(qualified_partitions) as qualified_partitions FROM svl_s3partition WHERE
query=<Query-ID> GROUP BY 1,2;
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Recommendations – 4 of 4
Improve Data Lake query performance with predicate pushdown
There are certain SQL operations that can be pushed down to the Amazon Redshift Spectrum layer.
You want to take advantage of these wherever possible. For example:
• GROUP BY clauses and some string functions
• Equal predicates and pattern-matching conditions such as LIKE
• Common aggregate functions such as COUNT, SUM, AVG, MIN, MAX, and many others
• Functions such as regex_replace and many others
Certain SQL operations like DISTINCT and ORDER BY must be performed in Amazon Redshift
because they can’t be pushed down to Amazon Redshift Spectrum. If possible, you should minimize
or avoid using them.
NOTE: Replace DISTINCT with GROUP BY
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Building Big Data Storage
Solutions (Data Lakes) for
Maximum Flexibility -
https://d1.awsstatic.com/whitepap
ers/Storage/data-lake-on-aws.pdf
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Data Lakes on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Snowball
Snowmobile Kinesis
Data Firehose
Kinesis
Data Streams
Amazon S3 AWS Glue
Building Big Data Storage
Solutions (Data Lakes) for
Maximum Flexibility -
https://d1.awsstatic.com/whitepap
ers/Storage/data-lake-on-aws.pdf
Redshift
EMR
Athena
Kinesis
Elasticsearch
Service
Dat a Lakes on AWS
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!

Mais conteúdo relacionado

Mais procurados

Fujitsu World Tour 2017 - Flash Forward
Fujitsu World Tour 2017 - Flash ForwardFujitsu World Tour 2017 - Flash Forward
Fujitsu World Tour 2017 - Flash ForwardFujitsu India
 
Cloud architecture patterns and pratices
Cloud architecture patterns and praticesCloud architecture patterns and pratices
Cloud architecture patterns and praticesGustavo Alzate Sandoval
 
Software-Defined Storage
Software-Defined StorageSoftware-Defined Storage
Software-Defined StorageNetApp
 
Denodo in the Age of Containers: How to Simplify Operation of your Virtual Layer
Denodo in the Age of Containers: How to Simplify Operation of your Virtual LayerDenodo in the Age of Containers: How to Simplify Operation of your Virtual Layer
Denodo in the Age of Containers: How to Simplify Operation of your Virtual LayerDenodo
 
HIPAS UCP HSP Openstack Sascha Oehl
HIPAS UCP HSP Openstack Sascha OehlHIPAS UCP HSP Openstack Sascha Oehl
HIPAS UCP HSP Openstack Sascha OehlSascha Oehl
 
IBM Cloud Paris Meetup - 20180628 - IBM Cloud Private
IBM Cloud Paris Meetup - 20180628 - IBM Cloud PrivateIBM Cloud Paris Meetup - 20180628 - IBM Cloud Private
IBM Cloud Paris Meetup - 20180628 - IBM Cloud PrivateIBM France Lab
 
Keith Inight, CTO at Atos - Software Defined Everything
Keith Inight, CTO at Atos - Software Defined EverythingKeith Inight, CTO at Atos - Software Defined Everything
Keith Inight, CTO at Atos - Software Defined EverythingGlobal Business Events
 
Cloudian and Rubrik - Hybrid Cloud based Disaster Recovery
Cloudian and Rubrik - Hybrid Cloud based Disaster RecoveryCloudian and Rubrik - Hybrid Cloud based Disaster Recovery
Cloudian and Rubrik - Hybrid Cloud based Disaster RecoveryCloudian
 
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
4 Ways FlexPod Forms the Foundation for Cisco and NetApp SuccessNetApp
 
Alluxio - Virtual Unified File System
Alluxio - Virtual Unified File System Alluxio - Virtual Unified File System
Alluxio - Virtual Unified File System Alluxio, Inc.
 
Azure data lakes
Azure data lakesAzure data lakes
Azure data lakesVishwas N
 
Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018NuoDB
 
In memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainIn memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainData Con LA
 
Intel and MariaDB: web-scale applications with distributed logs
Intel and MariaDB: web-scale applications with distributed logsIntel and MariaDB: web-scale applications with distributed logs
Intel and MariaDB: web-scale applications with distributed logsMariaDB plc
 
Cloud for the Hybrid Data Center
Cloud for the Hybrid Data CenterCloud for the Hybrid Data Center
Cloud for the Hybrid Data CenterNetAppUK
 
Dave Wright, SolidFire - SDDC Symposium 2014
Dave Wright, SolidFire - SDDC Symposium 2014Dave Wright, SolidFire - SDDC Symposium 2014
Dave Wright, SolidFire - SDDC Symposium 2014NetApp
 
Elastic community Abidjan #225 meetup 08 May 2021
Elastic community Abidjan #225 meetup 08 May 2021Elastic community Abidjan #225 meetup 08 May 2021
Elastic community Abidjan #225 meetup 08 May 2021Yassine, LASRI
 
Data center 2.0: The journey to the cloud from the datacenter perspertive by ...
Data center 2.0: The journey to the cloud from the datacenter perspertive by ...Data center 2.0: The journey to the cloud from the datacenter perspertive by ...
Data center 2.0: The journey to the cloud from the datacenter perspertive by ...HKISPA
 
Cloud and azure and rock and roll
Cloud and azure and rock and rollCloud and azure and rock and roll
Cloud and azure and rock and rollDavid Giard
 

Mais procurados (20)

Core Concept: Software Defined Everything
Core Concept: Software Defined EverythingCore Concept: Software Defined Everything
Core Concept: Software Defined Everything
 
Fujitsu World Tour 2017 - Flash Forward
Fujitsu World Tour 2017 - Flash ForwardFujitsu World Tour 2017 - Flash Forward
Fujitsu World Tour 2017 - Flash Forward
 
Cloud architecture patterns and pratices
Cloud architecture patterns and praticesCloud architecture patterns and pratices
Cloud architecture patterns and pratices
 
Software-Defined Storage
Software-Defined StorageSoftware-Defined Storage
Software-Defined Storage
 
Denodo in the Age of Containers: How to Simplify Operation of your Virtual Layer
Denodo in the Age of Containers: How to Simplify Operation of your Virtual LayerDenodo in the Age of Containers: How to Simplify Operation of your Virtual Layer
Denodo in the Age of Containers: How to Simplify Operation of your Virtual Layer
 
HIPAS UCP HSP Openstack Sascha Oehl
HIPAS UCP HSP Openstack Sascha OehlHIPAS UCP HSP Openstack Sascha Oehl
HIPAS UCP HSP Openstack Sascha Oehl
 
IBM Cloud Paris Meetup - 20180628 - IBM Cloud Private
IBM Cloud Paris Meetup - 20180628 - IBM Cloud PrivateIBM Cloud Paris Meetup - 20180628 - IBM Cloud Private
IBM Cloud Paris Meetup - 20180628 - IBM Cloud Private
 
Keith Inight, CTO at Atos - Software Defined Everything
Keith Inight, CTO at Atos - Software Defined EverythingKeith Inight, CTO at Atos - Software Defined Everything
Keith Inight, CTO at Atos - Software Defined Everything
 
Cloudian and Rubrik - Hybrid Cloud based Disaster Recovery
Cloudian and Rubrik - Hybrid Cloud based Disaster RecoveryCloudian and Rubrik - Hybrid Cloud based Disaster Recovery
Cloudian and Rubrik - Hybrid Cloud based Disaster Recovery
 
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
4 Ways FlexPod Forms the Foundation for Cisco and NetApp Success
 
Alluxio - Virtual Unified File System
Alluxio - Virtual Unified File System Alluxio - Virtual Unified File System
Alluxio - Virtual Unified File System
 
Azure data lakes
Azure data lakesAzure data lakes
Azure data lakes
 
Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018Introduction to NuoDB - March 2018
Introduction to NuoDB - March 2018
 
In memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGainIn memory computing principles by Mac Moore of GridGain
In memory computing principles by Mac Moore of GridGain
 
Intel and MariaDB: web-scale applications with distributed logs
Intel and MariaDB: web-scale applications with distributed logsIntel and MariaDB: web-scale applications with distributed logs
Intel and MariaDB: web-scale applications with distributed logs
 
Cloud for the Hybrid Data Center
Cloud for the Hybrid Data CenterCloud for the Hybrid Data Center
Cloud for the Hybrid Data Center
 
Dave Wright, SolidFire - SDDC Symposium 2014
Dave Wright, SolidFire - SDDC Symposium 2014Dave Wright, SolidFire - SDDC Symposium 2014
Dave Wright, SolidFire - SDDC Symposium 2014
 
Elastic community Abidjan #225 meetup 08 May 2021
Elastic community Abidjan #225 meetup 08 May 2021Elastic community Abidjan #225 meetup 08 May 2021
Elastic community Abidjan #225 meetup 08 May 2021
 
Data center 2.0: The journey to the cloud from the datacenter perspertive by ...
Data center 2.0: The journey to the cloud from the datacenter perspertive by ...Data center 2.0: The journey to the cloud from the datacenter perspertive by ...
Data center 2.0: The journey to the cloud from the datacenter perspertive by ...
 
Cloud and azure and rock and roll
Cloud and azure and rock and rollCloud and azure and rock and roll
Cloud and azure and rock and roll
 

Semelhante a AWS Data Lake: data analysis @ scale

Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Amazon Web Services
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAdir Sharabi
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Amazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Amazon Web Services
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Amazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Amazon Web Services
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCAmazon Web Services LATAM
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftAmazon Web Services
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Amazon Web Services
 

Semelhante a AWS Data Lake: data analysis @ scale (20)

Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28Building Data Lake on AWS | AWS Floor28
Building Data Lake on AWS | AWS Floor28
 
AWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWSAWS Floor 28 - Building Data lake on AWS
AWS Floor 28 - Building Data lake on AWS
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
Build Data Lakes and Analytics on AWS: Patterns & Best Practices - BDA305 - A...
 
Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
Building Serverless Analytics Solutions with Amazon QuickSight (ANT391) - AWS...
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

AWS Data Lake: data analysis @ scale

  • 1. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Giorgio Nobile, Solutions Architect April 24, 2018 Data Lake: analisi di dati @ scale
  • 2. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake Challenges Data variety and data volumes are increasing rapidly Multiple Consumers and Applications Ingest, Discover, Catalog, Understand, and Curate all kinds of data Quickly drive new insights
  • 3. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
  • 4. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Purpose-built engines. Right tool for the right job. Customer Needs Come First
  • 5. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon’s Purpose-Built Analytics Offerings Collect Store Analyze Amazon Kinesis Firehose AWS Direct Connect Amazon Snowball Amazon Kinesis Analytics Amazon Kinesis Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon Dynamo DB Amazon Elasticsearch Amazon EMR Amazon Redshift Amazon QuickSight AWS Database Migration Service AWS Glue Amazon Athena Amazon AI
  • 6. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Traditional Data Warehouse OLTP ERP CRM LOB Data Warehouse Business Intelligence Relational data Terabytes to Petabytes scale Schema defined prior to data load Operational reporting and ad hoc analysis
  • 7. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lakes extend traditional warehouses Relational and non-relational data Terabytes to Exabytes scale Schema defined during analysis (Schema on Read) Diverse analytical engines to gain insights Designed for low cost storage and analytics OLTP ERP CRM LOB Data Warehouse Business Intelligence Data Lake 1001100001001010111 0010101011100101010 0001011111011010 0011110010110010110 0100011000010 Devices Web Sensors Social Data Catalog Machine Learning DW Queries Big data processing Interactive Real-time
  • 8. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Amazon S3 AWS Glue Wide variety of ways to bring data in Durability and availability at Exabyte scale Security, compliance, and audit capabilities Run any analytics on the same data without movement Scale storage and compute independently Store at $0.023 / GB-month Query for $0.05 / GB scanned Redshift EMR Athena Kinesis Elasticsearch Service Data Lakes on AWS
  • 9. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake Architecture Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight Amazon AI EMR Redshift Athena Kinesis RDS Central Storage Secure, cost-effective Storage in Amazon S3 S3 Snowball Database Migration Service Kinesis Firehose Direct Connect Data Ingestion Get your data into S3 Quickly and securely Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified Processing & Analytics Use of predictive and prescriptive analytics to gain better understanding Security Token Service CloudWatch CloudTrail Key Management Service Catalog & Search Access and search metadata Access & User Interface Give your users easy and secure access DynamoDB Elasticsearch API Gateway Identity & Access Management Cognito QuickSight A Central Storage Secure, cost-effective Storage in Amazon S3 Glue ETL
  • 10. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless Analytics Deliver cost-effective analytic solutions faster S3 Data Lake Glue (Data Catalog and ETL) RedShift Spectrum QuickSight Serverless Zero infrastructure Zero administration Pay only for what you use, not for idle resources $ Availability and fault tolerance built in Automatically scales resources with usage Snowball Snowmobile Kinesis Data Firehose many other sources Other BI Tools Amazon Athena Amazon EMR
  • 11. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. Storing is not enough, data needs to be discoverable Dark data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). Gartner CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ”
  • 12. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue—Serverless Data Catalog & ETL Data Catalog ETL Discover data and extract schema Auto-generate customizable code in Python and Spark Automatically discovers data and stores schema Data is immediately searchable and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless
  • 13. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Crawlers: Automatic Schema Inference semi-structured per-file schema semi-structured unified schema identify file type and parse files enumerate S3 objects file 1 file 2 file N … int array intchar struct char int array struct char bool int int arrayint char char int custom classifiers Grok based parser built-in classifiers JSON parser CSV parser Parquet parser … bool
  • 14. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. IAM Role Glue Crawler Data Lakes Data Warehouse Databases Amazon RDS Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-In Classifiers MySQL MariaDB PostreSQL Aurora Redshift Avro Parquet ORC XML JSON & BSON Logs (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) Compressions (ZIP, BZIP, GZIP, LZ4, Snappy) What can Crawlers Classify?
  • 15. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Detecting Schema Similarity name: str id: num Schema A root addr street: str city: str zip: num name: str id: num Schema B root addr: str Schema similarity heuristic  1 point for matching name  1 point for matching data type  Match when similarity index > 0.7 intersection min(A,B) 7 8 .875sim
  • 16. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Custom Classifiers 1. Write a custom classifier 2. Add it to your crawler
  • 17. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Available partitions Automatically Detect Partitions
  • 18. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Automatically update table version as data evolves Automatic Schema Versioning
  • 19. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Other Ways of Creating Tables Call Glue’s CreateTable API Create table manually Run Hive DDL statement Apache Hive Metastore AWS GLUE ETL AWS GLUE DATA CATALOG Import from Apache Hive Metastore
  • 20. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift—Data Warehousing Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost Massively parallel, scale from gigabytes to petabytes Fast at any scale Columnar storage technology to improve I/O efficiency and scale query performance $ Inexpensive As low as $1,000 per terabyte per year, 1/10th the cost of traditional data warehouse solutions; Start at $0.25 per hour Open file formats Secure Audit everything; encrypt data end-to-end; extensive certification and compliance Analyze optimized data formats on the latest SSD, and all open data formats in Amazon S3
  • 21. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New Dense Compute Node - DC2 2X Performance @ Same Price as DC1 3x more I/O with 30% better storage utilization than DC1 “We saw a 9x reduction in month-end reporting time with Redshift DC2 nodes as compared to DC1” - Bradley Todd, Technical Architect, Liberty Mutual NVMe SSD DDR4 Memory Intel E5-2686 v4 (Broadwell)
  • 22. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Amazon Redshift Spectrum E x t e n d t h e d a t a w a r e h o u s e t o e x a b y t e s o f d a t a i n a n S 3 d a t a l a k e Exabyte Redshift SQL queries against S3 Join data across Redshift and S3 Scale compute and storage separately Stable query performance and unlimited concurrency CSV, ORC, Grok, Avro, & Parquet data formats Pay only for the amount of data scanned S3 data lakeRedshift data Redshift Spectrum query engine
  • 23. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Load Unload Backup Restore Amazon Redshift Spectrum Architecture Massively parallel, shared nothing columnar architecture Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing Compute nodes • Local, columnar storage • Executes queries in parallel • Load, unload, backup, restore Amazon Redshift Spectrum nodes • Execute queries directly against Amazon S3 SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores JDBC/ODBC 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node 128GB RAM 16TB disk 16 coresCompute Node Leader Node Amazon S3 ... 1 2 3 4 N Amazon Redshift Spectrum
  • 24. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift Spectrum Quer y your data lak e Amazon Redshift JDBC/ODBC ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage AWS Glue Data Catalog Redshift Spectrum Scale-out serverless compute Query SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY …
  • 25. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Defining Data Lake Artifacts (Sc hema on Read) Define an external schema in Amazon Redshift using the Data Catalog CREATE external schema archived_trips from data catalog database 'sampledb' iam_role 'arn:aws:iam::123456789012:role/MySpectrumRole' region 'us-east-2'; View External Schemas select * from svv_external_schemas View External Schemas select * from svv_external_tables
  • 26. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data Lake on Amazon S3 with AWS Glue On premises data Web app data Amazon RDS Other databases Streaming data Your data AMAZON QUICKSIGHT AWS GLUE ETL
  • 27. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS Glue Amazon S3 Data sources Amazon Redshift Redshift Spectrum BI Tools “Amazon Redshift Spectrum is a game changer for us. Reports that took minutes to produce are now delivered in seconds. We like the ability scale compute on-demand to query Petabytes of data in S3 in various open file formats.” -- Rafi Ton, CEO, NUVIAD NUVIAD is a marketing platform that helps media buyers optimize their mobile bidding Use AWS for marketing campaign and bidding analytics Scale S3 storage for unlimited data capacity Use Spectrum for unlimited scale and query concurrency 80% performance gain using Parquet data format —Data Lak e Analytic s with Reds hift Spec trum
  • 28. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Demonstration
  • 29. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Redshift IAM Callout. Amazon Redshift needs authorization to access the Data Catalog in AWS Glue and the data files in Amazon S3 on your behalf. To provide that authorization, you first create an AWS Identity and Access Management (IAM) role. Then you attach the role to your cluster and provide Amazon Resource Name (ARN) for the role in the Amazon Redshift CREATE EXTERNAL SCHEMA statement.
  • 30. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recommendations – 1 of 4 Use Amazon Redshift Spectrum to improve scan-intensive concurrent workloads Amazon Redshift Spectrum resides on dedicated Amazon Redshift servers that are independent of your cluster. It pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Amazon Redshift Spectrum layer, so queries use much less of your cluster’s processing capacity.
  • 31. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recommendations – 2 of 4 Query Optimize your Data Lake - Use Apache Parquet Pay special attention to two interesting metrics: • s3_scanned_rows • s3query_returned_rows You will notice a tremendous reduction in the amount of data that returns from Amazon Redshift Spectrum to Amazon Redshift for the final processing when compared to CSV files.
  • 32. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recommendations – 3 of 4 Query Optimize your Data Lake - Partition Parquet files The following SQL to analyze the effectiveness of partition pruning. If the query touches only a few partitions, you can verify if everything behaves as expected: • SELECT query, segment, max(assigned_partitions) as total_partitions, max(qualified_partitions) as qualified_partitions FROM svl_s3partition WHERE query=<Query-ID> GROUP BY 1,2;
  • 33. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Recommendations – 4 of 4 Improve Data Lake query performance with predicate pushdown There are certain SQL operations that can be pushed down to the Amazon Redshift Spectrum layer. You want to take advantage of these wherever possible. For example: • GROUP BY clauses and some string functions • Equal predicates and pattern-matching conditions such as LIKE • Common aggregate functions such as COUNT, SUM, AVG, MIN, MAX, and many others • Functions such as regex_replace and many others Certain SQL operations like DISTINCT and ORDER BY must be performed in Amazon Redshift because they can’t be pushed down to Amazon Redshift Spectrum. If possible, you should minimize or avoid using them. NOTE: Replace DISTINCT with GROUP BY
  • 34. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Amazon S3 AWS Glue Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility - https://d1.awsstatic.com/whitepap ers/Storage/data-lake-on-aws.pdf Redshift EMR Athena Kinesis Elasticsearch Service Data Lakes on AWS © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Snowball Snowmobile Kinesis Data Firehose Kinesis Data Streams Amazon S3 AWS Glue Building Big Data Storage Solutions (Data Lakes) for Maximum Flexibility - https://d1.awsstatic.com/whitepap ers/Storage/data-lake-on-aws.pdf Redshift EMR Athena Kinesis Elasticsearch Service Dat a Lakes on AWS
  • 35. © 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you!