Today’s organisations require a data storage and analytics solution that offers more agility and flexibility than traditional data management systems. Data Lake is a new and increasingly popular way to store all of your data, structured and unstructured, in one, centralised repository. Since data can be stored as-is, there is no need to convert it to a predefined schema and you no longer need to know what questions you want to ask of your data beforehand.
In this webinar, you will discover how AWS gives you fast access to flexible and low-cost IT resources, so you can rapidly scale and build your data lake that can power any kind of analytics such as data warehousing, clickstream analytics, fraud detection, recommendation engines, event-driven ETL, serverless computing, and internet-of-things processing regardless of volume, velocity and variety of data.
Learning Objectives:
• Discover how you can rapidly scale and build your data lake with AWS.
• Explore the key pillars behind a successful data lake implementation.
• Learn how to use the Amazon Simple Storage Service (S3) as the basis for your data lake.
• Learn about the new AWS services recently launched, Amazon Athena and Amazon Redshift Spectrum, that help customers directly query that data lake.
Breaking the Kubernetes Kill Chain: Host Path Mount
Best Practices for Building Your Data Lake on AWS
1. Best Practices for
Building Your Data
Lake on AWS
Ian Robinson, Specialist SA, AWS
Kiran Tamana, EMEA Head of Solutions Architecture,
Datapipe
Derwin McGeary, Solutions Architect, Cloudwick
2. Amazon DynamoDB
Amazon Relational
Database Service
Amazon Redshift
p.39
Donotcreatetitlesthatarelarger
thannecessary.
Donotusetoosmallafont
sizeonmainorsubtext.
re
sizeof
Ffont.
Maintextgoeshere
70%-80%ofthefontsizeof
thetitle.WhitneyHTFfont.
Maintextgoeshere
70%-80%ofthefontsizeof
thetitle.WhitneyHTFfont.
subtextheresubtexthere
PTITLEOFSTEPTITLEOFSTEP
Maintextgoeshere70%-80%ofthe
fontsizeofthetitle.WhitneyHTFfont.
subtextherethatexplainsmovement
orprocessinbetweensteps
TITLEOFSTEP
Amazon S3 Athena
QueryService
Amazon Athena Amazon EMR
On-premise databases
Relational Database
NoSQL
Data Warehouse
Data Lake
Webinars: Database Freedom
9. Data Lake versus Enterprise Data Warehouse
Complementary to EDW (not replacement) Data lake can be source for EDW
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI use
cases
BI use cases only (no prediction/advanced analytics)
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Enterprise DWEMR S3
24. S3 as the Data Lake Fabric
• Unlimited number of objects and
volume
• 99.99% availability
• 99.999999999% durability
• Versioning
• Tiered storage via lifecycle
policies
• SSL, client/server-side encryption
at rest
• Low cost (just over $2700/month
for 100TB)
• Natively supported by big data
frameworks (Spark, Hive, Presto,
etc)
• Decouples storage and compute
• Run transient compute
clusters (with Amazon EC2
Spot Instances)
• Multiple, heterogeneous
clusters can use same data
52. DATAPIPE.COM US: 877 773 3306 UK: +44 800 634 3414 HK: +852 3521 0215
AWS DATA LAKES
Kiran Tamana, EMEA Head of Solutions Architecture, Datapipe
53. PREMIER CONSULTING PARTNER IN EMEA AND NORTH AMERICA
RECOGNISED AS A LEADER BY GARTNER WORLDWIDE
170+
YEARS
2.2k+
DAYS
10x
WINNER
9+
MILLION
4
CONTINENTS
Audited AWS Premier Consulting
Partner and AWS Direct Connect Partner
Proprietary Keyless Management
Global Presence and support Datapipe has Data Center locations in
29 datacenters across 9 countries
Stevie Award for
Customer Service
Monthly Instance
Hours Managed
Combined
AWS Professional
Services Experience
Experience Managing
AWS Since 2010
54. • Vessel tracking via satellite and terrestrial ID data
• Vessel ownership & characteristic data
• Vessel registration, safety and insurance data
• Casualty, arrest and vessel inspection data
• Fleet development data
• Global Ports and Terminal data
DATA LAKES – A MARITIME USE CASE
120,000
Tracking vessels
163,500
Shipping companies
2,800
Ports
19,000
Terminals
A requirement to manage a global network of multiple data types, coming
from a wide variety of business critical maritime sources
55. DEVELOP A MIGRATION STRATEGY
Review ongoing technical deployments
Identify existing pain points
Conduct a situational analysis of current system
Evaluate infrastructure for automation
Assess configuration usage
Recommend details on application deployment management
57. Best Practices for Building Your Data Lake on AWS
Presented by
Derwin McGeary
Solutions Architect, Cloudwick EMEA
Contact us: services@cloudwick.com
58. Webinar Agenda
➢ About Cloudwick
➢ Reference Data Lake Architecture on AWS
➢ AWS Data Lake Implementations
➢ How to Get Started
Contact us: services@cloudwick.com
59. About Cloudwick
7
Founded
in 2010
3
Global Offices
US | EMEA | APAC
150+
Professionals
US | EMEA | APAC
Global
1000
Proven Success
400+
Certifications
Big Data, Analytics, Cybersecurity
Visualization & Cloud Partnerships
100+
Data Lake
Engagements
Contact us: services@cloudwick.com
60. Reference Data Lake Architecture on AWS
Marketing
Apps &
Databases
CRM
Databases
Other Semi-
structured data
Sources
Any Other
Data
Sources
Back office
Processing Data
Sources
Transformations/ETL
Curated
Layer
Raw
layer
Data Lake Data Lake
GLUE using PySpark/EMR
Data Ingestion Layer
Based on the data
velocity, volume and
veracity, pick the
appropriate ingest
services/other services
like Kafka, Flume, Sqoop,
DMS, Kinesis etc.
Data store Layer
This is the RAW layer for
the Data Lake. The
objects in the RAW layer
are immutable. The
structure to RAW layer is
typically partitioned by a
date/time
Curated Layer
This is the curated layer
for the Data Lake. The
data is ready for end
user. The
transformations/ETL
make the data useful.
Analytics Layer
Where the data is
modelled in the DWH.
This data is exposed
using JDBC/ODBC
connectors to the BI
tools, allowing for more
reuse and sharing.
Data Warehouse
Analytics
Layer
Visualizations
Business
Intelligence
S3 Storage
Access
Controlled
Zone
Secure and Controlled
access for Internal
Business Units and
3rd Party companies
API
Gateway
JDBC/ODBC
Native and 3rd Party BI and
visualisation tools
Access Layer
This layer allows for
innovative reuse, access,
visualisation, and
reporting on the data.
Contact us: services@cloudwick.com
61. AWS Data Lake Implementations
Contact us: services@cloudwick.com
62. SGN Use Case : Background & Challenges
62
➢ One of the Largest Gas Distribution Company
➢ Reporting Responsibilities
➢ Data Silos
➢ Challenges
Growing Data Sources
Existing ETL process
Too many silos
Data Provision for different use cases
Operational and Analytical Reporting
Contact us: services@cloudwick.com
63. Use Case: Data Lake and Migration for SGN
63
Contact us: services@cloudwick.com
64. Where to start?
Data Inventory - what data/metadata do you
already have?
Gap Analysis - what business and technical
metadata do you need for your use cases?
Value: Knowing the value of data and getting value
from data
Cloudwick can help you get started:
Contact us: services@cloudwick.com
Cloudwick Quickstart
http://tinyurl.com/CloudwickQuickStart
Cloudwick Jumpstart
http://tinyurl.com/CloudwickJumpStart