Watch full webinar here: https://bit.ly/33GgqE9
Data Lake strategies seem to have found their perfect companion in cloud providers. After years of criticism and struggles in the on-prem Hadoop world, data lakes are flourishing thanks to the simplification in management and low storage prices provided by SaaS vendors. For some, this is the ultimate data strategy. For others, just a repetition of the same mistakes. Attend this session to learn:
- The benefits and shortcoming of cloud data lakes
- The role and value of data virtualization in this scenario
- New development in data virtualization for cloud
3. Agenda
1. Current challenges in data management
2. Cloud Data Lakes
3. Shortcomings of data lakes
4. Data virtualization and cloud data lakes working
together
5. Cloud, on prem and hybrid
6. Key takeaways
4. 4
Current Challenges in Data Management
1. End Users: faster & more accurate decision making
Significant increase in business speed & complexity of
requirements
2. Regulations: enterprise-wide governance & data security
Thousands of new regulations worldwide: tax, finance, privacy, HR,
environmental, etc.
3. IT: cost reduction
Huge data growth with associated storage and operational costs
5. 5
Data lakes were born to efficiently
address the challenge of cost reduction:
data lakes allow for cheap, efficient
storage of very large amounts of data
Cloud implementation simplified the
complexity of managing a large data
lake
6. 6
A Bit of History – Etymology of “data lake”
https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
Pentaho’s CTO James Dixon is credited with coining
the term "data lake". He described it in his blog in
2010:
"If you think of a data mart as a store of bottled
water – cleansed and packaged and structured
for easy consumption – the data lake is a large
body of water in a more natural state. The
contents of the data lake stream in from a
source to fill the lake, and various users of the
lake can come to examine, dive in, or take
samples."
7. 7
The Data Lake – Architecture I
Distributed File SystemDistributed File System
Cheap storage for large data volumes
• Support for multiple file formats (Parquet, CSV,
JSON, etc)
• Examples:
• On-prem: HDFS
• Cloud native: AWS S3, Azure ADLS, Google GCS
8. 8
The Data Lake – Architecture II
Distributed File System
Execution EngineExecution Engine
Massively parallel & scalable execution engine
• Cheaper execution than traditional EDW
architectures
• Decoupled from storage
• Doesn’t require specialized HW
• Examples:
• SQL-on-Hadoop engines: Spark, Hive, Impala, Drill,
Dremio, Presto, etc.
• Cloud native: AWS Redshift, Snowflake, AWS Athena,
Delta Lake, GCP BigQuery
9. 9
The Data Lake – Architecture III
Adoption of new transformation techniques
• Data ingested is normally raw and unusable by end
users
• Data is transformed and moved to different
“zones” with different levels of curation
• End users only access the refined zone
• Use of ELT as a cheaper transformation technique
than ETL
• Use of the engine and storage of the lake for data
transformation instead of external ETL flows
• Removes the need for additional staging HW
Raw zone Trusted zone Refined Zone
Distributed File System
Execution Engine
10. 10
Data Lake Example –AWS
• Data ingested using AWS Glue (or other ETL tools)
• Raw data stored in S3 object store
• Maintain fidelity and structure of data
• Metadata extracted/enriched using Glue Data Catalog
• Business rules/DQ rules applied to S3 data as copied to
Trusted Zone data stores
• Trusted Zone contains more than one data store – select
best data store for data and data processing
• Refined Zone contains data for consumer – curated data
sets (data marts?)
• Refined Zone data stores differ – Redshift, Athena,
Snowflake, …
TRUSTED ZONERAW ZONE
S3 for raw data
INGESTION
Data Sources
Internal
&
external
AWS Glue
Consumers
Data Portals
BI –Visualization
Analytic
Workbench
Mobile Apps
Etc.
REFINED ZONE
11. 11
Hadoop-Based Data Lakes – A Data Scientist’s Playground
The early data scientists saw Hadoop as their
personal supercomputer.
Hadoop-based Data Lakes helped democratize
access to state-of-the-art supercomputing with
off-the-shelf HW (and later cloud)
The industry push for BI made Hadoop–based
solutions the standard to bring modern
analytics to any corporation
Hadoop-based Data Lakes became
“data science silos”
12. Can data lakes also address the
other data management
challenges?
Can they provide fast decision
making with proper
governance and security?
13. 13
Changing the Data Lake Goals
“The popular view is that a
data lake will be the one
destination for all the data
in their enterprise and the
optimal platform for all
their analytics.”
Nick Heudecker, Gartner
14. 14
The Data Lake as the Repository of All Data
• Huge up-front investment: creating ingestion pipelines for all company datasets into the
lake is costly
• Questionable ROI as a lot of that data may never be used
• Replicate the EDW? Replace it entirely?
• Large recurrent maintenance costs: those pipelines need to be constantly modified as
data structures change in the sources
• Risk of inconsistencies: data needs to be frequently synchronized to avoid stale datasets
• Loss of capabilities: data lake capabilities may differ from those of original sources, e.g.
quick access by ID in operational RDBMS
Efficient use of the data lake to accelerate insights comes at the cost of price,
time-to-market and governance
COST
GOVERNANCE
To efficiently enable self-service initiatives, a data lake must provide access to all company data.
Is that realistic? And even if possible, it comes with multiple trade-offs:
15. 15
Purpose-specific data lakes
• Higher complexity: end users need to find where data is and how to use it
• Risk of Inconsistencies: data may be in multiple places, in different formats
and calculated at different times
• Loss of security: frustrations increase the use of shadow IT, “personal”
extracts, uncontrolled data prep flows, etc.
An environment with multiple purpose-specific systems slows down TTM and
jeopardizes security and governance
TTM
SECURITY
If we restrict the use of the data lake to a specific use case (e.g. data science), some of those
problems go away.
However, to maintain the capabilities for fast insights and self-service, we add an additional
burden to the end user:
16. 16
Data Lakes in the ‘Pit of Despair’
Data Lakes are 5-10 years from
Plateau of Productivity and are
deep in the
Trough of Disillusionment
18. 18
Gartner – Logical Data Warehouse
“Adopt the Logical Data Warehouse Architecture to Meet Your Modern Analytical Needs”. Henry Cook, Gartner April 2018
DATA VIRTUALIZATION
19. How can a logical
architecture enabled by
Data Virtualization help?
20. 20
Faster Time to Market for data projects
A data virtualization layer allows you to connect directly to all kinds of data sources: the EDW,
application databases, SaaS applications, etc.
This means that not all data needs to be replicated to the data lake for consumers to access it
from a single (virtual) repository.
In some cases, it makes sense to replicate in the lake, for others it doesn’t. DV opens that door
Data can be accessed immediately, easily improving TTM and ROI of the lake
If data is not useful, time was not lost preparing pipelines and copying data
Can ingest and synchronize data into the lake efficiently when needed
Denodo can load and update data into the data lake natively, using Parquet, and parallel loads
Execution is pushed down to original sources, taking advantage of their capabilities
Especially significant in the case of EDW with strong processing capabilities
TTM
COST
21. 21
Easier self-service through a single delivery layer
From an end user perspective, access to all data is done through a single layer, in
change of delivery of any data, regardless of its actual physical location
A single delivery layer also allows you to enforce security and governance policies
The virtual layer becomes the “delivery zone” of the data lake, offering modeling and
caching capabilities, documentation and output in multiple formats
GOVERNANCE
• Built-in rich modeling capabilities to tailor data models to end
users
• Integrated catalog, search and documentation capabilities
• Access via SQL, REST, OData and GraphQL with no additional
coding
• Advanced security controls, SSO, workload management,
monitoring, etc.
22. 22
Accelerates query execution
Controlling data delivery separately from storage allows a virtual layer to accelerate
query execution, providing faster response than the sources alone
Aggregate-aware capabilities to accelerate execution of
analytical queries
Flexible caching options to materialize frequently used data:
Full datasets
Partial results
Hybrid (cached content + updates from source in real time)
Powerful optimization capabilities for multi-source federated
queries
PERFORMANCE
23. 23
Denodo’s Logical Data Lake
ETL
Data Warehouse
Kafka
Physical Data Lake
Logical Data Lake
Files
ETL
Data Warehouse
Kafka
Physical Data Lake
Files
IT Storage and Processing
BI & Reporting
Mobile
Applications
Predictive Analytics
AI/ML
Real time dashboards
Consuming Tools
QueryEngine
BusinessDelivery
SourceAbstraction
Business Catalog
Security and Governance
Raw zone Trusted zone Refined Zone
Distributed File System
Execution Engine
Delivery Zone
25. 25
Denodo Customers Cloud Survey - 2019
• More than 60% of companies already have multiple projects in cloud
• 25% are Cloud-First and/or are in “advanced” state
• Only 4.5% do not have plans for Cloud in the short term
• More than 46% have hybrid integration needs, more than 35% are already multi-cloud
• Key Use Cases include: Analytics (49%), Data Lake (45%), Cloud Data Warehouse (40%)
• Less than 9% of on-prem systems are decommissioned (Forrester estimates 8%)
• Key Technologies in Cloud Journey: Cloud Platform Tools (56%), Data Virtualization (49.5%),
Data Lake Technology (48%)
Source: Denodo Cloud Survey 2019, N = 200.
https://www.denodo.com/en/document/whitepaper/denodo-global-cloud-survey-2019
26. 26
Denodo and cloud
A virtual layer lake Denodo should be deployed based on “data gravity”: wherever most of
your data is
However, as we have seen, data gravity can change overtime and requires hybrid models
Denodo’s deployment model allows for multiple options
• Cloud deployments with full automation of the infrastructure management
• One-click changes in cluster settings, type of nodes, versions, etc.
• Elastic options for cluster auto-scaling
• Traditional on-prem installations
• Hybrid models with cloud and on-prem Denodo installations talking to each other
28. 1. In most cases, not all the data is going to be in the
data lake
2. Large data lake projects are complex environments
that will benefit from a virtual ‘consumption’ layer
3. Data virtualization provides a governance and
management infrastructure required for successful
data lake implementation
4. Data Virtualization is more than just a data access
or services layer, it is a key component for a Data
Lake
Key Takeaways
29.
30. 30
Next Steps
Access Denodo Platform in the Cloud!
Take a Test Drive today!
www.denodo.com/TestDrive
G E T S TA R T E D TO DAY