Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration

Denodo TechTalks
Product Deep-Dive Series
A product deep-dive, webinar series covering
the critical capabilities of Denodo’s modern
data virtualization

Denodo TechTalks
Product Deep-Dive Series
Leveraging Data Lake Capabilities:
Hybrid Cloud, Summary Views and
MPP Acceleration
Edwin Robbins
Senior Sales Engineer at Denodo

Agenda
Denodo ech al s
rod ct Deep Di e eries
1. Logical Data Fabric Introduction
2. 3 Key Data Fabric Capabilities
• Seamless data access across platforms
• AI-based Query Acceleration Optimization
• MPP Query Acceleration
3. Demonstrations
4. Q&A

Logical Data Fabric Introduction

5
Denodo ech al s
Logical Data Fabric is the evolution of Denodo’s Data Virtualization Logical Architecture
Logical Data Fabric
Demystifying the Data Fabric,
September 2020
The core of the matter is being
able to consolidate many
diverse data sources in an
efficient manner by allowing
trusted data to be delivered
from all relevant data sources
to all relevant data consumers
through one common layer.

3 Key Data Fabric Capabilities

7
Denodo ech al s
A Logical Data Fabric
▪ Pillar 1 - Integrates data across hybrid environments
▪ Pillar 2 - Automates manual tasks using augmented intelligence
▪ Pillar 3 - Boosts performance of analytics with rapid data delivery
▪ Pillar 4 - Supports data discovery and data science initiatives
▪ Pillar 5 - Analyzes across data at rest and data in motion
▪ Pillar 6 - Catalogs all data for discovery, lineage, and associations
https://www.denodo.com/en/document/analyst-report/tdwi-checklist-report-
six-critical-capabilities-logical-data-fabric - May 2020

8
1. Source Abstraction
What’s the impact of a new
marketing campaign for each
country?
▪ Historical sales data offloaded to
Hadoop cluster (Presto) for cheaper
storage
▪ Marketing campaigns managed in an
external SaaS cloud app that returns
data in JSON format
▪ Country is part of the customer table
stored in the Oracle DW Sources
Combine,
Transform
&
Integrate
Consume
Base View
Source
Abstraction
join
group and sum
join
Sales
(2.8 million rows)
Campaign
(300 rows)
Customer
(100,000 rows)
Data Catalog
Virtual Table (View)
Role Based Security
& Masking
Push Down
Optimization
& Caching
SaaS
App
Data Services

9
2. Automatic Recommendations for Smart Query Acceleration
Denodo v8 uses Artificial Intelligence to
automatically recommend selective
materialization of datasets (summaries) to
increase performance
• AI Algorithms using Active Metadata
• Usage history from Denodo monitor
• Data profiling statistics of sources
• Cost simulations to generate summary
recommendations
• Optimization
• Generic enough to cover multiple queries
• Specific enough to keep it small and fast

10
2. Query Acceleration with Aggregate Awareness Example
• TPCS-DS data:
• Distributed in 3 different systems
• Tables with hundreds of millions of rows
• Summary: total sales by store id, sold_date_id
Query
Execution Time
(no acceleration)
Execution Time
(acceleration)
Performance Gain Summary used
Total sales by year 15.45 s 2.38 s 6.5 x summary_total_by_store_day
Total sales by quarter,
store name and city
22.49 s 2.62 s 8.57 x summary_total_by_store_day
Total sales by store and
City for last quarter
Total sales in a specific store 14.36 s 2.66 s 5.39 x summary_total_by_store_day
Total sales in a specific store
and year

11
3. Denodo and MPPs
• Currently, Denodo can integrate with a variety of MPP Data Lake engines
• Hive, Impala, Presto, SparkSQL, Athena and Databricks/Delta Lake
• The integration with this systems covers multiple capabilities
• As a data source
• As the storage for Denodo’s cache
• As an additional external execution engine
• Denodo can move data on-the-fly from other sources to the MPP for
execution
• Controlled by the Cost Based Optimizer, can also be used by manual hints
• As a target for replication pipelines (remote tables)
• Run a query in Denodo and put the results in the data lake
• If source and target are the same (e.g. data lake), data is processed
directly instead of coming to Denodo first
• Better support for incremental updates in data lake engines

12
Denodo ech al s
join
Group by ZIP
join
Group by ZIP
3. Massive Parallel Processing: Example
2M rows
(sales by customer)
Customer
(2M rows)
System Execution Time Optimization Techniques
Others ~ 10 min Basic
No MPP 43 sec Aggregation push-down
With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes)
Sales
(300 million rows)
join
Group by ZIP
1. Partial Aggregation
push down
Maximizes source processing
Reduces network traffic
3. On-demand data transfer
For SQL-on-Hadoop systems,
Denodo automatically generates
and upload Parquet files
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or a is native table
in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
Of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
With MPP Integration
group by
customer ID

13
Denodo ech al s
3. Future: Embedded MPP execution engine
▪ Customer with existing data lakes are encourage to use
their existing ones with equivalent capabilities
▪ However, an embedded engine has some advantages
▪ High performant MPP queries over data in distributed
filesystems without the need of additional software
▪ Out-of-the-box MPP options for caching and acceleration
capabilities
▪ Efficient integrated store for large volumes of active metadata
/ query history to enable upcoming AI capabilities
▪ Integrated security, deployment configuration and
management

14
Denodo ech al s
3. MPP Integration Timeline
1. Preliminary phase (Q2 2020)
▪ Evaluate and choose engine: PrestoSQL
2. Phase I (Q3 2020)
▪ Scripts / templates to automatically create and manage a cluster
running on Kubernetes (on-prem or cloud with EKS/AKS)
3. Phase II (Q1 2021)
▪ Automatic metadata integration. Files in distributed file system are
automatically accessible from Denodo
▪ Automatically create presto Tables and Denodo Base Views from
path to parquet Files.
▪ To access datasets from Presto, data files from AWS S3 (or Azure
Blob Storage, or Azure Data Lake, etc..) are mapped to tables in the
Hive Metastore.
4. Phase III
▪ Tighter UI integration (explore filesystem graphically)
5. Final Phase (Denodo 9)
▪ Deployment, management and monitoring is fully integrated in
Denodo's Solution Manager
This integration will be done using a phased approach, so that many of this capabilities can be released periodically
before the next major version

Q&A
Denodo ech al s

17
Next Steps
Get Started Today
Denodo Standard Free Trial
Try Denodo Standard free for 30 days
on your choice of cloud environment
denodo.com/free-trials

19
Denodo ech al s

Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.

Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration

Semelhante a Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration (20)

Mais de Denodo

Mais de Denodo (20)

Último

Último (20)

Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration