SlideShare uma empresa Scribd logo
1 de 49
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shafreen Sayyed
Solutions Architect, Amazon Web Services
Using Data Lakes to quench your
Analytics fire
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Forces and Trends Prompting the Move to Cloud
Cost Optimization
Licenses
Hardware
Data center and operations
Dark Data
Prematurely discarding data
Agility
Experimentation (data & tools)
Democratised Access to Data
Time-to-first-results
Terminate failed experiments early
From BI to Data Science
In-house data science
From back office to product
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Storage is the Gravity for Cloud Applications
Store all your data, for ever, at every stage of its lifecycle
Apply it using the right tool for the job
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where do I start?
Ingest /
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers and
insights
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where do I start?
Ingest /
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers and
insights
Start here
(with a business case)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Storage is Job #1
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Object Storage is Foundational
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data storage
Amazon
S3
Amazon
DynamoDB
Amazon
Elasticsearch
Service
Amazon RDS
Versioning
Lifecycle
Management
5 TB Objects
Designed for
99.999999999%
Durability
Replication
Reliability
Security
Scalability
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent
Access
Amazon Glacier
Create
Delete
Events and Lifecycle Management
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S3 as the Data Lake Fabric
• Unlimited number of objects
and volume
• 99.99% availability
• 99.999999999% durability
• Versioning
• Tiered storage via lifecycle
policies
• SSL, client/server-side
encryption at rest
• Low cost (just over
$2700/month for 100TB)
• Natively supported by big
data frameworks (Spark, Hive,
Presto, etc)
• Decouples storage and
compute
• Run transient compute
clusters (with Amazon EC2
Spot Instances)
• Multiple, heterogeneous
clusters can use same data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Database Migration
Service
Automated Data Ingestion
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Stream Events to S3 Using Kinesis Firehose
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Write Database Changes to S3 with DMS
<schema_name>/<table_name>/LOAD001.csv
<schema_name>/<table_name>/LOAD002.csv
<schema_name>/<table_name>/<time-stamp>.csv
Full Load
Change Data Capture
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data ingestion
Amazon
Kinesis
AWS IoT
• Fully-managed real-time
stream processing
• Highly available across
multiple AZs
• Can capture and store:
• Terabytes of data per hour
• From hundreds of thousands
sources
• Collect data from your
connected devices
• Communicate securely back to
your devices
• Can easily support:
• Billions of devices
• Trillions of messages
“If you knew the state of every thing in the world, and could
reason on top of that data, what problems could you solve?”
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data collection
• Dedicated 1 Gbps and 10
Gbps fibre link to AWS
• Low cost, with consistent
low latency/jitter
• Direct access to AWS
services and your VPCs
• Tamper-resistant case
and electronics
• Ruggedized case that
can withstand 8.5 G
• Available in 50 TB or 80
TB capacities
AWS
Snowball
AWS Database
Migration Service
• Modernise, migrate, or
replicate your RDBMS
• Fan-in multiple sources
to single target
• Platform and schema
conversion
AWS Direct
Connect
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Scalable (secure, versioned, durable) storage +
Immutable data at every stage of its lifecycle +
Versioned schema and metadata
=
Data discovery, lineage
Storage + Catalog
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Glue
Data Catalog
Discover and store metadata
Job Execution
Serverless scheduling and execution
Job Authoring
Auto-generated ETL code
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hive metastore-compatible, highly-
available metadata repository:
• Classification for identifying and
parsing files
• Versioning of table metadata as
schemas evolve
• Table definitions – usable by
Redshift, Athena, Glue, EMR
Populate using Hive DDL, bulk import,
or automatically through crawlers.
Glue Data Catalog
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
int
int
arrayint
char
char int
custom classifiers
app log parser
metrics parser
…
system classifiers
JSON parser
CSV parser
Apache log parser
…
bool
Crawlers: Automatic Schema Inference
bool
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AWS Lambda
Metadata Index
(Amazon DynamoDB)
Search Index
(Amazon Elasticsearch)
ObjectCreated
ObjectDeleted PutItem
Update Stream
Update Index
Extract Search Fields
Indexing and Searching Using Metadata
Amazon S3
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data processing and analysis
• Petabyte scale data
warehouse
• Fault-tolerant scalable
cluster with node auto-
recovery
• Auto backup into
Amazon S3
Amazon
Redshift
Structured
data processing
• Fully-managed big data
platform
• Auto-scaling clusters
• Supports Hadoop:
• Hive, Spark, Presto
• Zeppelin, HBase, Flink
• HDFS and Amazon S3
filesystems
Amazon
EMR
Semi/unstructured
data processing
• No infrastructure to manage
• No data loading required
• Supports multiple data
formats:
• CSV, TSV, Avro, ORC, Parquet
• Uses ANSI SQL to directly
query Amazon S3
Amazon
Athena
Serverless
query processing
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Consume and visualise data
• No infrastructure to
manage
• Event-driven processing
• Pay per 100 ms CPU
• Node.js, Python, Java
and C# (.NET Core)
AWS
Lambda
Compute
platforms
• No infrastructure to
manage
• Multiple classifier types
• Interactive UI for modelling
and dataset visualisation
Amazon
Machine Learning
Machine learning
• No infrastructure to manage
• Fast, cloud-powered BI tool
• Scales to hundreds of
thousands of users
• Quick calculations with SPICE
Amazon
QuickSight
Business
intelligence
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at
$0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift has security built in
SSL to secure data in transit
Encryption to secure data at rest
• AES-256; hardware accelerated
• All blocks on disks and in Amazon S3 encrypted
• HSM support
No direct access to compute nodes
Audit logging, AWS CloudTrail, AWS KMS
integration
Amazon VPC support
SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3/Amazon DynamoDB
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Redshift Spectrum
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node
Leverages Amazon Redshift’s advanced cost-
based optimizer
Pushes down projections, filters, aggregations
and join reduction
Dynamic partition pruning to minimize data
processed
Automatic parallelization of query execution
against Amazon S3 data
Efficient join processing within the Amazon
Redshift cluster
Spectrum
Nodes
Redshift
Nodes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Redshift works with third-party analysis tools
JDBC/ODBC
Amazon Redshift
Amazon
QuickSight
New!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Security is Job #0
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Data Access & Authorisation
Give your users easy and secure access
Storage & Catalog
Secure, cost-effective storage
in Amazon S3. Robust
metadata in AWS Catalog
Protect and Secure
Use entitlements to ensure data is secure
and users’ identities are verified
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Identity and Access Management
• Manage users, groups, and roles
• Identity federation with Open ID
• Temporary credentials with Amazon Security
Token Service (Amazon STS)
• Stored policy templates
• Powerful policy language
• Amazon S3 bucket policies
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
IAM
Amazon
S3
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
EMR
Amazon
Kinesis
Amazon
Athena
Service API Access
Security at the Data Level
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Third Party Ecosystem Security Tools
Amazon
S3
AWS
CloudTrail
http://amzn.to/2tSimHj
Amazon
Athena
Access Logging
API Logging
Access Log
Analytics
IAM
Amazon
EMR
http://amzn.to/2si6RqS
Storage Level Support for Access Logging and Audit
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Encryption Options
AWS Server-Side encryption
• AWS managed key infrastructure
AWS Key Management Service
• Automated key rotation & auditing
• Integration with other AWS services
AWS CloudHSM
• Dedicated Tenancy SafeNet Luna SA HSM Device
• Common Criteria EAL4+, NIST FIPS 140-2
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless Processing and Analytics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Python code generated
by AWS Glue
• Connect a notebook or
IDE to AWS Glue
• Existing code brought
into AWS Glue
Managed ETL with AWS Glue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
• Schedule-based
• Event-based
• On demand
Job Execution with AWS Glue
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Athena – Analyze Data in S3
• Interactive queries
• ANSI SQL
• No infrastructure or administration
• Zero spin up time
• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take
advantage of Amazon S3 durability and availability
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Simple query editor
with syntax highlighting
and autocomplete
Data Catalog
Query History, Saved Queries, and
Catalog Management
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises
sources including Amazon Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Using Amazon Athena with Amazon QuickSight
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time
visualizations and alarms
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
SELECT STREAM author,
count(author) OVER ONE_MINUTE
FROM Tweets
WINDOW ONE_MINUTE AS
(PARTITION BY author
RANGE INTERVAL '1' MINUTE PRECEDING)
WHERE text LIKE ‘%#BigDataCapeTown%';
Amazon Kinesis Analytics – Simple SQL Interface
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Again, where do I start?
Ingest /
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers and
insights
Seriously, start here
(with a business case)
Then collect your data
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Security &
Governance IAM
Amazon
CloudWatch
AWS
CloudTrail
AWS
KMS
AWS
CloudHSM
AWS Directory
Service
Data Catalog Amazon Athena
Catalog
RDS
Hive
Metastore EMR RDS
Glue
Catalog
Amazon
Cognito
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Summary
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Build decoupled systems
• Use Amazon S3 as the data fabric of your data lake
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable log, batch, interactive & real-time views
Be cost-conscious
• Big data ≠ big cost
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
What's New with Amazon Redshift ft. Dow Jones (ANT350-R) - AWS re:Invent 2018
 
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF LoftData Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
Data Warehousing with Amazon Redshift: Data Analytics Week at the SF Loft
 
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon RedshiftBuilding a Modern Data Warehouse - Deep Dive on Amazon Redshift
Building a Modern Data Warehouse - Deep Dive on Amazon Redshift
 
Architecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWSArchitecting a Serverless Data Lake on AWS
Architecting a Serverless Data Lake on AWS
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
How Amazon uses AWS Analytics
How Amazon uses AWS AnalyticsHow Amazon uses AWS Analytics
How Amazon uses AWS Analytics
 
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
Leadership Session: AWS Database and Analytics (DAT206-L) - AWS re:Invent 2018
 
AWS & Database Analytics
AWS & Database AnalyticsAWS & Database Analytics
AWS & Database Analytics
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
Building a Data Lake for Your Enterprise, ft. Sysco (STG309) - AWS re:Invent ...
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Building a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - WebinarBuilding a Modern Data Architecture on AWS - Webinar
Building a Modern Data Architecture on AWS - Webinar
 
Preparing Data for the Lake
Preparing Data for the LakePreparing Data for the Lake
Preparing Data for the Lake
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 

Semelhante a Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018

Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
Amazon Web Services
 

Semelhante a Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018 (20)

Builders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LCBuilders' Day - Building Data Lakes for Analytics On AWS LC
Builders' Day - Building Data Lakes for Analytics On AWS LC
 
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
Cutting to the chase for Machine Learning Analytics Ecosystem & AWS Lake Form...
 
AWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scaleAWS Data Lake: data analysis @ scale
AWS Data Lake: data analysis @ scale
 
Implementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdfImplementazione di una soluzione Data Lake.pdf
Implementazione di una soluzione Data Lake.pdf
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
Data Lake Implementation: Processing and Querying Data in Place (STG204-R1) -...
 
Big Data@Scale
 Big Data@Scale Big Data@Scale
Big Data@Scale
 
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF LoftData Warehouses & Data Lakes: Data Analytics Week at the SF Loft
Data Warehouses & Data Lakes: Data Analytics Week at the SF Loft
 
Data Warehouses and Data Lakes
Data Warehouses and Data LakesData Warehouses and Data Lakes
Data Warehouses and Data Lakes
 
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech TalksAnalyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
Analyze your Data Lake, Fast @ Any Scale - AWS Online Tech Talks
 
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
Big Data Analytics Architectural Patterns and Best Practices (ANT201-R1) - AW...
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
Database Freedom. Database migration approaches to get to the Cloud - Marcus ...
 
21st Century Analytics with Zopa
21st Century Analytics with Zopa21st Century Analytics with Zopa
21st Century Analytics with Zopa
 
Data Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SFData Warehouses & Data Lakes: Data Analytics Week SF
Data Warehouses & Data Lakes: Data Analytics Week SF
 
Using AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your ApplicationsUsing AWS Purpose-Built Databases to Modernize your Applications
Using AWS Purpose-Built Databases to Modernize your Applications
 
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift SpectrumModernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
Modernise your Data Warehouse with Amazon Redshift and Amazon Redshift Spectrum
 
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA308 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
From raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakeFrom raw data to business insights. A modern data lake
From raw data to business insights. A modern data lake
 

Mais de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mais de Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018

  • 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Shafreen Sayyed Solutions Architect, Amazon Web Services Using Data Lakes to quench your Analytics fire
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Forces and Trends Prompting the Move to Cloud Cost Optimization Licenses Hardware Data center and operations Dark Data Prematurely discarding data Agility Experimentation (data & tools) Democratised Access to Data Time-to-first-results Terminate failed experiments early From BI to Data Science In-house data science From back office to product
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Storage is the Gravity for Cloud Applications Store all your data, for ever, at every stage of its lifecycle Apply it using the right tool for the job
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where do I start? Ingest / Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers and insights
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where do I start? Ingest / Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers and insights Start here (with a business case)
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Storage is Job #1
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Object Storage is Foundational
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data storage Amazon S3 Amazon DynamoDB Amazon Elasticsearch Service Amazon RDS Versioning Lifecycle Management 5 TB Objects Designed for 99.999999999% Durability Replication Reliability Security Scalability
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Standard Active data Archive dataInfrequently accessed data Standard - Infrequent Access Amazon Glacier Create Delete Events and Lifecycle Management
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. S3 as the Data Lake Fabric • Unlimited number of objects and volume • 99.99% availability • 99.999999999% durability • Versioning • Tiered storage via lifecycle policies • SSL, client/server-side encryption at rest • Low cost (just over $2700/month for 100TB) • Natively supported by big data frameworks (Spark, Hive, Presto, etc) • Decouples storage and compute • Run transient compute clusters (with Amazon EC2 Spot Instances) • Multiple, heterogeneous clusters can use same data
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Database Migration Service Automated Data Ingestion
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Stream Events to S3 Using Kinesis Firehose
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Write Database Changes to S3 with DMS <schema_name>/<table_name>/LOAD001.csv <schema_name>/<table_name>/LOAD002.csv <schema_name>/<table_name>/<time-stamp>.csv Full Load Change Data Capture
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data ingestion Amazon Kinesis AWS IoT • Fully-managed real-time stream processing • Highly available across multiple AZs • Can capture and store: • Terabytes of data per hour • From hundreds of thousands sources • Collect data from your connected devices • Communicate securely back to your devices • Can easily support: • Billions of devices • Trillions of messages “If you knew the state of every thing in the world, and could reason on top of that data, what problems could you solve?”
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data collection • Dedicated 1 Gbps and 10 Gbps fibre link to AWS • Low cost, with consistent low latency/jitter • Direct access to AWS services and your VPCs • Tamper-resistant case and electronics • Ruggedized case that can withstand 8.5 G • Available in 50 TB or 80 TB capacities AWS Snowball AWS Database Migration Service • Modernise, migrate, or replicate your RDBMS • Fan-in multiple sources to single target • Platform and schema conversion AWS Direct Connect
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Scalable (secure, versioned, durable) storage + Immutable data at every stage of its lifecycle + Versioned schema and metadata = Data discovery, lineage Storage + Catalog
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue Data Catalog Discover and store metadata Job Execution Serverless scheduling and execution Job Authoring Auto-generated ETL code
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hive metastore-compatible, highly- available metadata repository: • Classification for identifying and parsing files • Versioning of table metadata as schemas evolve • Table definitions – usable by Redshift, Athena, Glue, EMR Populate using Hive DDL, bulk import, or automatically through crawlers. Glue Data Catalog
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. semi-structured per-file schema semi-structured unified schema identify file type and parse files enumerate S3 objects file 1 file 2 file N … int array intchar struct char int array struct char int int arrayint char char int custom classifiers app log parser metrics parser … system classifiers JSON parser CSV parser Apache log parser … bool Crawlers: Automatic Schema Inference bool
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Lambda Metadata Index (Amazon DynamoDB) Search Index (Amazon Elasticsearch) ObjectCreated ObjectDeleted PutItem Update Stream Update Index Extract Search Fields Indexing and Searching Using Metadata Amazon S3
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data processing and analysis • Petabyte scale data warehouse • Fault-tolerant scalable cluster with node auto- recovery • Auto backup into Amazon S3 Amazon Redshift Structured data processing • Fully-managed big data platform • Auto-scaling clusters • Supports Hadoop: • Hive, Spark, Presto • Zeppelin, HBase, Flink • HDFS and Amazon S3 filesystems Amazon EMR Semi/unstructured data processing • No infrastructure to manage • No data loading required • Supports multiple data formats: • CSV, TSV, Avro, ORC, Parquet • Uses ANSI SQL to directly query Amazon S3 Amazon Athena Serverless query processing
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Consume and visualise data • No infrastructure to manage • Event-driven processing • Pay per 100 ms CPU • Node.js, Python, Java and C# (.NET Core) AWS Lambda Compute platforms • No infrastructure to manage • Multiple classifier types • Interactive UI for modelling and dataset visualisation Amazon Machine Learning Machine learning • No infrastructure to manage • Fast, cloud-powered BI tool • Scales to hundreds of thousands of users • Quick calculations with SPICE Amazon QuickSight Business intelligence
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Relational data warehouse Massively parallel; petabyte scale Fully managed HDD and SSD platforms $1,000/TB/year; starts at $0.25/hour Amazon Redshift a lot faster a lot simpler a lot cheaper
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift has security built in SSL to secure data in transit Encryption to secure data at rest • AES-256; hardware accelerated • All blocks on disks and in Amazon S3 encrypted • HSM support No direct access to compute nodes Audit logging, AWS CloudTrail, AWS KMS integration Amazon VPC support SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Amazon S3/Amazon DynamoDB Customer VPC Internal VPC JDBC/ODBC Leader Node Compute Node Compute Node Compute Node
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Redshift Spectrum ... 1 2 3 4 N Amazon S3 Exabyte-scale object storage Data Catalog Apache Hive Metastore 10 GigE (HPC) Ingestion Backup Restore SQL Clients/BI Tools 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores 128GB RAM 16TB disk 16 cores Customer VPC Internal VPC JDBC/ODBC Leader Node Compute Node Compute Node Compute Node Leverages Amazon Redshift’s advanced cost- based optimizer Pushes down projections, filters, aggregations and join reduction Dynamic partition pruning to minimize data processed Automatic parallelization of query execution against Amazon S3 data Efficient join processing within the Amazon Redshift cluster Spectrum Nodes Redshift Nodes
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift works with third-party analysis tools JDBC/ODBC Amazon Redshift Amazon QuickSight New!
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Security is Job #0
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Access & Authorisation Give your users easy and secure access Storage & Catalog Secure, cost-effective storage in Amazon S3. Robust metadata in AWS Catalog Protect and Secure Use entitlements to ensure data is secure and users’ identities are verified
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Identity and Access Management • Manage users, groups, and roles • Identity federation with Open ID • Temporary credentials with Amazon Security Token Service (Amazon STS) • Stored policy templates • Powerful policy language • Amazon S3 bucket policies
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. IAM Amazon S3 Amazon ElastiCache Amazon DynamoDB Amazon EMR Amazon Kinesis Amazon Athena Service API Access Security at the Data Level
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Third Party Ecosystem Security Tools Amazon S3 AWS CloudTrail http://amzn.to/2tSimHj Amazon Athena Access Logging API Logging Access Log Analytics IAM Amazon EMR http://amzn.to/2si6RqS Storage Level Support for Access Logging and Audit
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Encryption Options AWS Server-Side encryption • AWS managed key infrastructure AWS Key Management Service • Automated key rotation & auditing • Integration with other AWS services AWS CloudHSM • Dedicated Tenancy SafeNet Luna SA HSM Device • Common Criteria EAL4+, NIST FIPS 140-2
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless Processing and Analytics
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. • Python code generated by AWS Glue • Connect a notebook or IDE to AWS Glue • Existing code brought into AWS Glue Managed ETL with AWS Glue
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. • Schedule-based • Event-based • On demand Job Execution with AWS Glue
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Athena – Analyze Data in S3 • Interactive queries • ANSI SQL • No infrastructure or administration • Zero spin up time • Query data in its raw format • AVRO, Text, CSV, JSON, weblogs, AWS service logs • Convert to an optimized form like ORC or Parquet for the best performance and lowest cost • No loading of data, no ETL required • Stream data from directly from Amazon S3, take advantage of Amazon S3 durability and availability
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Simple query editor with syntax highlighting and autocomplete Data Catalog Query History, Saved Queries, and Catalog Management
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises sources including Amazon Athena Amazon RDS Amazon S3 Amazon Redshift Amazon Athena Using Amazon Athena with Amazon QuickSight
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Analytics • Interact with streaming data in real time using SQL • Build fully managed and elastic stream processing applications that process data for real-time visualizations and alarms
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. SELECT STREAM author, count(author) OVER ONE_MINUTE FROM Tweets WINDOW ONE_MINUTE AS (PARTITION BY author RANGE INTERVAL '1' MINUTE PRECEDING) WHERE text LIKE ‘%#BigDataCapeTown%'; Amazon Kinesis Analytics – Simple SQL Interface
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Again, where do I start? Ingest / Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers and insights Seriously, start here (with a business case) Then collect your data
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Security & Governance IAM Amazon CloudWatch AWS CloudTrail AWS KMS AWS CloudHSM AWS Directory Service Data Catalog Amazon Athena Catalog RDS Hive Metastore EMR RDS Glue Catalog Amazon Cognito
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Summary
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Build decoupled systems • Use Amazon S3 as the data fabric of your data lake • Data → Store → Process → Store → Analyze → Answers Use the right tool for the job • Data structure, latency, throughput, access patterns Leverage AWS managed services • Scalable/elastic, available, reliable, secure, no/low admin Use log-centric design patterns • Immutable log, batch, interactive & real-time views Be cost-conscious • Big data ≠ big cost
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Thank you!