SlideShare uma empresa Scribd logo
1 de 43
Baixar para ler offline
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
How to Build a Data Lake with AWS
Glue Data Catalog
P r a j a k t a D a m l e
S e n i o r P r o d u c t M a n a g e r , A W S G l u e
A B D 2 1 3
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to expect from this session
Data challenge today
What is a data lake?
What is AWS Glue Data Catalog?
How does AWS Glue catalogue my data?
My data is catalogued, what’s next?
Q&A
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Your data today
Documents and files Records Streams
Amazon
RDS
Amazon
DynamoDB
AWS IoT
On Premises
databases
Amazon Kinesis
Streams
Spreadsheets Infrastructure logs
Clickstream data Mobile app data
Social media data Amazon
Redshift
Device data Amazon Kinesis
Firehose
Sensor data
ERP
Multiple sources and formats… and growing everyday
Why is this a new problem?
Web and mobile
data
Logs
Social Media data
Streaming data IOT data
Spreadsheets
Structured data
Unstructured and Semi-structured data
Dark data
Data Volume
The Data Gap
1990 2000 2010 2020
Generated Data
Available for Analysis
Dark data challenge
Multiple consumers and requirements
Data duplication
Data Scientists
Analysts
Business Users
Applications
Agile Real time
Flexible Scale
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a data lake?
Collect and store all data, at any
scale, and low cost
Help locate, curate, and secure your
data
Provide democratized access to data
within your organization
Quickly and easily perform new
types of data analysis
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a data lake
Quickly ingest and store
any type of data, at any
scale, and at low cost
Have a single source of
truth and quickly search
and find the relevant data
Easily query the data
through a unified set of
tools
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Layers of a data lake
Athena
Lex
Amazon
Rekognition
Amazon
Polly Amazon ML
Store
Secure
Analyze
AI
AMAZON QUICKSIGHT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The missing piece
> A unified view into your data no matter where it is stored
Integration with your analytics tools
A way to automatically build your metadata and keep it in
sync with your data as it evolves
>
>
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automatically discover and categorize your data making it immediately
searchable and queryable across data sources.
Generate code to clean, enrich, and reliably move data between various
data sources. Easily customize this code or bring your own.
Run your jobs on a serverless, fully managed, scale-out environment. No
compute resources to provision or manage.
Discover
Develop
Deploy
What is AWS Glue?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Select AWS Glue customers
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job AuthoringData Catalog Job Execution
Apache Hive Metastore compatible
Integrated with AWS services
Automatic crawling
Discover
Auto-generates ETL code
Python and Apache
Spark
Edit, debug, and share
Develop
Serverless execution
Flexible scheduling
Monitoring and alerting
Deploy
AWS Glue Components
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is the AWS Glue Data Catalog?
Unified metadata repository across relational databases, Amazon RDS,
Amazon Redshift, and Amazon S3…with support for more coming soon!
• Get a single view into your data, no matter where it is stored
• Automatically classify your data in one central list that is searchable
• Track data evolution using schema versioning
• Query your data using Amazon Athena or Amazon Redshift Spectrum
• Apache Hive metastore compatible; can be used as an external
metastore for applications running on Amazon EMR
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake on Amazon S3 with AWS Glue
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AMAZON
QUICKSIGHT
AWS GLUE ETL
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Logical data lake with AWS Glue
Unified view
AWS GLUE ETL
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How do I set up the Glue Data Catalog?
Call Glue’s CreateTable API
Create table manually Run Hive DDL statement
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Easier way to build the Glue Data Catalog
1. Tell us where your data is
2. Tell us how often you want to check for updates
And you are done! Your Data Catalog is ready for search and querying
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What are crawlers?
Crawlers automatically build your Data Catalog and keep it in sync
• Scan your data stored in various data stores, extract metadata and data statistics,
and add table definitions to your Data Catalog
• Classify data using built-in and custom classifiers
• You can write your own using Grok expressions
• Discover new data, extracts schema definitions
• Detect schema changes and version tables
• Detect Hive style partitions on Amazon S3
• Run on demand or on a schedule; serverless – only pay when crawler runs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
GitHub timeline data
20+ event types
githubarchive.org
unique payload
per event type
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Table schema
Table properties
Data statistics
Nested fields
A table in the Glue Data Catalog
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How is my data classified?
Crawlers apply a set of classifiers to the data as they scan it and add the metadata as
Tables to the Data Catalog.
A classifier recognizes the format of your data and generates a schema.
It returns a certainty number between 0.0 and 1.0, which helps crawlers determine if
there is a match.
Glue has a list of in-build classifiers that are applied with every crawl. But you can
write your own!
You can set up your crawler with an ordered set of classifiers. Crawlers invoke
classifiers in the order they were provided until a match is found.
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Crawlers: automatic schema inference
semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
bool int
int
arrayint
char
char int
custom classifiers
Grok based parser
built-in classifiers
JSON parser
CSV parser
Parquet parser
…
bool
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Detecting schema similarity
name:
str
id: num
Schema A
root
addr
street: str city: str zip: num
name:
str
id: num
Schema B
root
addr: str
Schema similarity heuristic
 1 point for matching name
 1 point for matching data type
 Match when similarity index > 0.7
intersection
min(A,B)
7
8
.875sim
IAM Role
Glue Crawler
Data Lakes
Data Warehouse
Databases
Amazon
RDS
Amazon
Redshift
Amazon S3
JDBC Connection
Object Connection
Built-In Classifiers
MySQL
MariaDB
PostreSQL
Oracle
Microsoft SQL Server
Amazon Aurora
Amazon Redshift
Avro
Parquet
ORC
XML
JSON & BSON
Logs
(Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis,
and many others)
Delimited
(comma, pipe, tab, semicolon)
Compressions
(ZIP, BZIP, GZIP, LZ4, Snappy)
What can crawlers classify? Create additional Custom
Classifiers with Grok!
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How can I write my own classifiers?
You can write a custom classifier by providing a
Grok pattern and a classification string for the
matched schema.
A Grok pattern is a named set of regular
expressions (regex) that are used to match data
one line at a time.
Example:
%{TIMESTAMP_ISO8601:timestamp}
[%{MESSAGEPREFIX:message_prefix}]
%{CRAWLERLOGLEVEL:loglevel} :
%{GREEDYDATA:message}
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Custom classifiers
1. Write a custom classifier 2. Add it to your crawler
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Available partitions
Automatically detected partitions
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
How are partitions detected?
Estimate schema similarity among files at each level to
handle semi-structured logs, schema evolution…
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Automatically update table version as data evolves
Automatic schema versioning
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Import/Export your metadata
Apache Hive
Metastore
Apache Hive
Metastore
Import from an external metastore Export to an external metastore
Find the import/export ETL script on Glue’s GitHub repository
AWS GLUE ETL
AWS GLUE ETL
AWS GLUE
DATA CATALOG
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Your data is catalogued…what’s next?
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Quickly find your data
Search on key terms Save results and come back to it later
Query data in Amazon Athena
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analyze same data with different engines
AMAZON
QUICKSIGHT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Amazon Athena?
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
$
SQL
Query Instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30-90% on per-query
costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Amazon EMR?
Analytics and ML at scale with 19 open-source projects
Integration with AWS Glue Data Catalog for Apache Spark, Apache Hive, and Presto
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50-80%
Use S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Apache Hadoop & Apache
Spark in minutes; no cluster
setup, node provisioning,
cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is Amazon Redshift Spectrum?
E x t e n d t h e d a t a w a r e h o u s e t o y o u r S 3 d a t a l a k e
S3 data lakeAmazon
Redshift data
Amazon Redshift Spectrum
query engine
Exabyte Amazon Redshift SQL queries against S3
Join data across Amazon Redshift and S3
Scale compute and storage separately
Stable query performance and unlimited concurrency
Parquet, ORC, Grok, Avro, & CSV data formats
Pay only for the amount of data scanned
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless data exploration
Crawlers AWS GLUE DATA
CATALOG
Data
Unified view
Data explorer
>
Gain insight in minutes without
the need to configure and
operationalize infrastructure
Data scientists want fast
access to disparate datasets for
data exploration
>
>
Glue automatically catalogues
heterogeneous data sources, and
offers serverless Apache Spark
infrastructure for interactive analysis
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Move data across storage systems
Unified view
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data lake vs. data warehouse
Data lake Data warehouse
Semi-structured /Unstructured
/structured data
Structured data
Schema on read Schema on write
Data science, predictive analysis, BI
use cases
SQL based BI use cases
Great for storing granular data; raw as
well as processed data
Great for storing frequently accessed
data as well as data aggregates and
summary
Separation of compute and storage Tightly coupled compute and storage
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Interoperate data lake and data warehouse
On premises data
Web app data
Amazon RDS
Other databases
Streaming data
Your data
AWS GLUE ETL
AMAZON
QUICKSIGHT
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Key announcements (coming soon)
>
>
>
Write Glue ETL jobs in Scala, in addition to PySpark
Glue available in eu-west-1 (Ireland)
Glue available in ap-northeast-1(Tokyo)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!
G l u e - p m @ a m a z o n . c o m
h t t p s : / / a w s . a m a z o n . c o m / g l u e / d e v e l o p e r - r e s o u r c e s /

Mais conteúdo relacionado

Mais procurados

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Cathrine Wilhelmsen
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Amazon Web Services
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogDATAVERSITY
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueAmazon Web Services
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSAmazon Web Services
 
Hadoop Migration to databricks cloud project plan.pptx
Hadoop Migration to databricks cloud project plan.pptxHadoop Migration to databricks cloud project plan.pptx
Hadoop Migration to databricks cloud project plan.pptxyashodhannn
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudAmazon Web Services
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Amazon Web Services
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureDATAVERSITY
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSAmazon Web Services
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 

Mais procurados (20)

Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)Deep Dive - Amazon Elastic MapReduce (EMR)
Deep Dive - Amazon Elastic MapReduce (EMR)
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Activate Data Governance Using the Data Catalog
Activate Data Governance Using the Data CatalogActivate Data Governance Using the Data Catalog
Activate Data Governance Using the Data Catalog
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
Implementing a Data Lake
Implementing a Data LakeImplementing a Data Lake
Implementing a Data Lake
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Big Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWSBig Data Architectural Patterns and Best Practices on AWS
Big Data Architectural Patterns and Best Practices on AWS
 
Hadoop Migration to databricks cloud project plan.pptx
Hadoop Migration to databricks cloud project plan.pptxHadoop Migration to databricks cloud project plan.pptx
Hadoop Migration to databricks cloud project plan.pptx
 
Building Data Lakes in the AWS Cloud
Building Data Lakes in the AWS CloudBuilding Data Lakes in the AWS Cloud
Building Data Lakes in the AWS Cloud
 
Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...Everything You Need to Know About Big Data: From Architectural Principles to ...
Everything You Need to Know About Big Data: From Architectural Principles to ...
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data ArchitectureADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
ADV Slides: Strategies for Fitting a Data Lake into a Modern Data Architecture
 
Best Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWSBest Practices for Building Your Data Lake on AWS
Best Practices for Building Your Data Lake on AWS
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 

Semelhante a How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Amazon Web Services
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAmazon Web Services
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...Amazon Web Services
 
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018Amazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSAmazon Web Services
 
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...Amazon Web Services
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with ZopaAmazon Web Services
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesAmazon Web Services
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansAmazon Web Services
 
Deploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
Deploying Business Analytics at Enterprise Scale - AWS Online Tech TalksDeploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
Deploying Business Analytics at Enterprise Scale - AWS Online Tech TalksAmazon Web Services
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfAmazon Web Services
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...Amazon Web Services
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudAmazon Web Services
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAmazon Web Services
 
ABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseAmazon Web Services
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Amazon Web Services
 

Semelhante a How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017 (20)

Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
Best Practices for Building a Data Lake in Amazon S3 and Amazon Glacier, with...
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
ABD318_Architecting a data lake with Amazon S3, Amazon Kinesis, AWS Glue and ...
 
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
Build an ETL Pipeline to Analyze Customer Data (AIM416) - AWS re:Invent 2018
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
Best Practices for Distributed Machine Learning and Predictive Analytics Usin...
 
Construindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWSConstruindo data lakes e analytics com AWS
Construindo data lakes e analytics com AWS
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best PracticesBuild Data Lakes & Analytics on AWS: Patterns & Best Practices
Build Data Lakes & Analytics on AWS: Patterns & Best Practices
 
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best PracticesBuild Data Lakes and Analytics on AWS: Patterns & Best Practices
Build Data Lakes and Analytics on AWS: Patterns & Best Practices
 
STG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data OceansSTG206_Big Data Data Lakes and Data Oceans
STG206_Big Data Data Lakes and Data Oceans
 
Deploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
Deploying Business Analytics at Enterprise Scale - AWS Online Tech TalksDeploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
Deploying Business Analytics at Enterprise Scale - AWS Online Tech Talks
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
RET301-Build Single Customer View across Multiple Retail Channels using AWS S...
 
Building a Modern Data Platform in the Cloud
Building a Modern Data Platform in the CloudBuilding a Modern Data Platform in the Cloud
Building a Modern Data Platform in the Cloud
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
ABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For EnterpriseABD311_Deploying Amazon QuickSight For Enterprise
ABD311_Deploying Amazon QuickSight For Enterprise
 
Building Data Lakes with AWS
Building Data Lakes with AWSBuilding Data Lakes with AWS
Building Data Lakes with AWS
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 

Mais de Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Último

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 

Último (20)

Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 

How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT How to Build a Data Lake with AWS Glue Data Catalog P r a j a k t a D a m l e S e n i o r P r o d u c t M a n a g e r , A W S G l u e A B D 2 1 3
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What to expect from this session Data challenge today What is a data lake? What is AWS Glue Data Catalog? How does AWS Glue catalogue my data? My data is catalogued, what’s next? Q&A
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your data today Documents and files Records Streams Amazon RDS Amazon DynamoDB AWS IoT On Premises databases Amazon Kinesis Streams Spreadsheets Infrastructure logs Clickstream data Mobile app data Social media data Amazon Redshift Device data Amazon Kinesis Firehose Sensor data ERP Multiple sources and formats… and growing everyday
  • 4. Why is this a new problem? Web and mobile data Logs Social Media data Streaming data IOT data Spreadsheets Structured data Unstructured and Semi-structured data Dark data
  • 5. Data Volume The Data Gap 1990 2000 2010 2020 Generated Data Available for Analysis Dark data challenge
  • 6. Multiple consumers and requirements Data duplication Data Scientists Analysts Business Users Applications Agile Real time Flexible Scale
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is a data lake? Collect and store all data, at any scale, and low cost Help locate, curate, and secure your data Provide democratized access to data within your organization Quickly and easily perform new types of data analysis
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of a data lake Quickly ingest and store any type of data, at any scale, and at low cost Have a single source of truth and quickly search and find the relevant data Easily query the data through a unified set of tools
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Layers of a data lake Athena Lex Amazon Rekognition Amazon Polly Amazon ML Store Secure Analyze AI AMAZON QUICKSIGHT
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The missing piece > A unified view into your data no matter where it is stored Integration with your analytics tools A way to automatically build your metadata and keep it in sync with your data as it evolves > >
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Automatically discover and categorize your data making it immediately searchable and queryable across data sources. Generate code to clean, enrich, and reliably move data between various data sources. Easily customize this code or bring your own. Run your jobs on a serverless, fully managed, scale-out environment. No compute resources to provision or manage. Discover Develop Deploy What is AWS Glue?
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Select AWS Glue customers
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Job AuthoringData Catalog Job Execution Apache Hive Metastore compatible Integrated with AWS services Automatic crawling Discover Auto-generates ETL code Python and Apache Spark Edit, debug, and share Develop Serverless execution Flexible scheduling Monitoring and alerting Deploy AWS Glue Components
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is the AWS Glue Data Catalog? Unified metadata repository across relational databases, Amazon RDS, Amazon Redshift, and Amazon S3…with support for more coming soon! • Get a single view into your data, no matter where it is stored • Automatically classify your data in one central list that is searchable • Track data evolution using schema versioning • Query your data using Amazon Athena or Amazon Redshift Spectrum • Apache Hive metastore compatible; can be used as an external metastore for applications running on Amazon EMR
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake on Amazon S3 with AWS Glue On premises data Web app data Amazon RDS Other databases Streaming data Your data AMAZON QUICKSIGHT AWS GLUE ETL
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Logical data lake with AWS Glue Unified view AWS GLUE ETL
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How do I set up the Glue Data Catalog? Call Glue’s CreateTable API Create table manually Run Hive DDL statement
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Easier way to build the Glue Data Catalog 1. Tell us where your data is 2. Tell us how often you want to check for updates And you are done! Your Data Catalog is ready for search and querying
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What are crawlers? Crawlers automatically build your Data Catalog and keep it in sync • Scan your data stored in various data stores, extract metadata and data statistics, and add table definitions to your Data Catalog • Classify data using built-in and custom classifiers • You can write your own using Grok expressions • Discover new data, extracts schema definitions • Detect schema changes and version tables • Detect Hive style partitions on Amazon S3 • Run on demand or on a schedule; serverless – only pay when crawler runs
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. GitHub timeline data 20+ event types githubarchive.org unique payload per event type
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Table schema Table properties Data statistics Nested fields A table in the Glue Data Catalog
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How is my data classified? Crawlers apply a set of classifiers to the data as they scan it and add the metadata as Tables to the Data Catalog. A classifier recognizes the format of your data and generates a schema. It returns a certainty number between 0.0 and 1.0, which helps crawlers determine if there is a match. Glue has a list of in-build classifiers that are applied with every crawl. But you can write your own! You can set up your crawler with an ordered set of classifiers. Crawlers invoke classifiers in the order they were provided until a match is found.
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Crawlers: automatic schema inference semi-structured per-file schema semi-structured unified schema identify file type and parse files enumerate S3 objects file 1 file 2 file N … int array intchar struct char int array struct char bool int int arrayint char char int custom classifiers Grok based parser built-in classifiers JSON parser CSV parser Parquet parser … bool
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Detecting schema similarity name: str id: num Schema A root addr street: str city: str zip: num name: str id: num Schema B root addr: str Schema similarity heuristic  1 point for matching name  1 point for matching data type  Match when similarity index > 0.7 intersection min(A,B) 7 8 .875sim
  • 25. IAM Role Glue Crawler Data Lakes Data Warehouse Databases Amazon RDS Amazon Redshift Amazon S3 JDBC Connection Object Connection Built-In Classifiers MySQL MariaDB PostreSQL Oracle Microsoft SQL Server Amazon Aurora Amazon Redshift Avro Parquet ORC XML JSON & BSON Logs (Apache (Grok), Linux(Grok), MS(Grok), Ruby, Redis, and many others) Delimited (comma, pipe, tab, semicolon) Compressions (ZIP, BZIP, GZIP, LZ4, Snappy) What can crawlers classify? Create additional Custom Classifiers with Grok!
  • 26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How can I write my own classifiers? You can write a custom classifier by providing a Grok pattern and a classification string for the matched schema. A Grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. Example: %{TIMESTAMP_ISO8601:timestamp} [%{MESSAGEPREFIX:message_prefix}] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}
  • 27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Custom classifiers 1. Write a custom classifier 2. Add it to your crawler
  • 28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Available partitions Automatically detected partitions
  • 29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How are partitions detected? Estimate schema similarity among files at each level to handle semi-structured logs, schema evolution…
  • 30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Automatically update table version as data evolves Automatic schema versioning
  • 31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Import/Export your metadata Apache Hive Metastore Apache Hive Metastore Import from an external metastore Export to an external metastore Find the import/export ETL script on Glue’s GitHub repository AWS GLUE ETL AWS GLUE ETL AWS GLUE DATA CATALOG
  • 32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Your data is catalogued…what’s next?
  • 33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Quickly find your data Search on key terms Save results and come back to it later Query data in Amazon Athena
  • 34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Analyze same data with different engines AMAZON QUICKSIGHT
  • 35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Amazon Athena? Interactive query service to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load $ SQL Query Instantly Zero setup cost; just point to Amazon S3 and start querying Pay per query Pay only for queries run; save 30-90% on per-query costs through compression Open ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Easy Serverless: zero infrastructure, zero administration Integrated with Amazon QuickSight
  • 36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Amazon EMR? Analytics and ML at scale with 19 open-source projects Integration with AWS Glue Data Catalog for Apache Spark, Apache Hive, and Presto Enterprise-grade security $ Latest versions Updated with the latest open source frameworks within 30 days of release Low cost Flexible billing with per- second billing, EC2 spot, reserved instances and auto-scaling to reduce costs 50-80% Use S3 storage Process data directly in the Amazon S3 data lake securely with high performance using the EMRFS connector Easy Launch fully managed Apache Hadoop & Apache Spark in minutes; no cluster setup, node provisioning, cluster tuning Data Lake 100110000100101011100 1010101110010101000 00111100101100101 010001100001
  • 37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is Amazon Redshift Spectrum? E x t e n d t h e d a t a w a r e h o u s e t o y o u r S 3 d a t a l a k e S3 data lakeAmazon Redshift data Amazon Redshift Spectrum query engine Exabyte Amazon Redshift SQL queries against S3 Join data across Amazon Redshift and S3 Scale compute and storage separately Stable query performance and unlimited concurrency Parquet, ORC, Grok, Avro, & CSV data formats Pay only for the amount of data scanned
  • 38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Serverless data exploration Crawlers AWS GLUE DATA CATALOG Data Unified view Data explorer > Gain insight in minutes without the need to configure and operationalize infrastructure Data scientists want fast access to disparate datasets for data exploration > > Glue automatically catalogues heterogeneous data sources, and offers serverless Apache Spark infrastructure for interactive analysis
  • 39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Move data across storage systems Unified view
  • 40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data lake vs. data warehouse Data lake Data warehouse Semi-structured /Unstructured /structured data Structured data Schema on read Schema on write Data science, predictive analysis, BI use cases SQL based BI use cases Great for storing granular data; raw as well as processed data Great for storing frequently accessed data as well as data aggregates and summary Separation of compute and storage Tightly coupled compute and storage
  • 41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Interoperate data lake and data warehouse On premises data Web app data Amazon RDS Other databases Streaming data Your data AWS GLUE ETL AMAZON QUICKSIGHT
  • 42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Key announcements (coming soon) > > > Write Glue ETL jobs in Scala, in addition to PySpark Glue available in eu-west-1 (Ireland) Glue available in ap-northeast-1(Tokyo)
  • 43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! G l u e - p m @ a m a z o n . c o m h t t p s : / / a w s . a m a z o n . c o m / g l u e / d e v e l o p e r - r e s o u r c e s /