Watch full webinar here: https://bit.ly/3ohtRqm
Companies with corporate data lakes also need a strategy for how to best integrate them with their overall data fabric. To take full advantage of a data lake, data architects must determine what data belongs in the Lake vs. other sources, how end users are going to find and connect to the data they need as well as the best way to leverage the processing power of the data lake. This webinar will provide you with a deep dive look at how the Denodo Platform for data virtualization enables companies to maximize their investment in their corporate data lake.
Watch on-demand this webinar to learn:
- How to create a logical data fabric with Denodo
- How to leverage the a data lake for MPP Acceleration and Summary Views
- How to leverage Presto with Denodo for file based data lakes (ie. S3, ADLS, HDFS, etc.)
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
1. Denodo TechTalks
Product Deep-Dive Series
A product deep-dive, webinar series covering
the critical capabilities of Denodo’s modern
data virtualization
2. Denodo TechTalks
Product Deep-Dive Series
Leveraging Data Lake Capabilities:
Hybrid Cloud, Summary Views and
MPP Acceleration
Edwin Robbins
Senior Sales Engineer at Denodo
3. Agenda
Denodo ech al s
rod ct Deep Di e eries
1. Logical Data Fabric Introduction
2. 3 Key Data Fabric Capabilities
• Seamless data access across platforms
• AI-based Query Acceleration Optimization
• MPP Query Acceleration
3. Demonstrations
4. Q&A
5. 5
Denodo ech al s
rod ct Deep Di e eries
Logical Data Fabric is the evolution of Denodo’s Data Virtualization Logical Architecture
Logical Data Fabric
Demystifying the Data Fabric,
September 2020
The core of the matter is being
able to consolidate many
diverse data sources in an
efficient manner by allowing
trusted data to be delivered
from all relevant data sources
to all relevant data consumers
through one common layer.
7. 7
Denodo ech al s
rod ct Deep Di e eries
A Logical Data Fabric
▪ Pillar 1 - Integrates data across hybrid environments
▪ Pillar 2 - Automates manual tasks using augmented intelligence
▪ Pillar 3 - Boosts performance of analytics with rapid data delivery
▪ Pillar 4 - Supports data discovery and data science initiatives
▪ Pillar 5 - Analyzes across data at rest and data in motion
▪ Pillar 6 - Catalogs all data for discovery, lineage, and associations
https://www.denodo.com/en/document/analyst-report/tdwi-checklist-report-
six-critical-capabilities-logical-data-fabric - May 2020
8. 8
1. Source Abstraction
What’s the impact of a new
marketing campaign for each
country?
▪ Historical sales data offloaded to
Hadoop cluster (Presto) for cheaper
storage
▪ Marketing campaigns managed in an
external SaaS cloud app that returns
data in JSON format
▪ Country is part of the customer table
stored in the Oracle DW Sources
Combine,
Transform
&
Integrate
Consume
Base View
Source
Abstraction
join
group and sum
join
Sales
(2.8 million rows)
Campaign
(300 rows)
Customer
(100,000 rows)
Data Catalog
Virtual Table (View)
Role Based Security
& Masking
Push Down
Optimization
& Caching
SaaS
App
Data Services
9. 9
2. Automatic Recommendations for Smart Query Acceleration
Denodo v8 uses Artificial Intelligence to
automatically recommend selective
materialization of datasets (summaries) to
increase performance
• AI Algorithms using Active Metadata
• Usage history from Denodo monitor
• Data profiling statistics of sources
• Cost simulations to generate summary
recommendations
• Optimization
• Generic enough to cover multiple queries
• Specific enough to keep it small and fast
10. 10
2. Query Acceleration with Aggregate Awareness Example
• TPCS-DS data:
• Distributed in 3 different systems
• Tables with hundreds of millions of rows
• Summary: total sales by store id, sold_date_id
Query
Execution Time
(no acceleration)
Execution Time
(acceleration)
Performance Gain Summary used
Total sales by year 15.45 s 2.38 s 6.5 x summary_total_by_store_day
Total sales by quarter,
store name and city
22.49 s 2.62 s 8.57 x summary_total_by_store_day
Total sales by store and
City for last quarter
14.71 s 0.47 s 31.1 x summary_total_by_store_day
Total sales in a specific store 14.36 s 2.66 s 5.39 x summary_total_by_store_day
Total sales in a specific store
and year
14.32 s 3.18 s 4.0 x summary_total_by_store_day
11. 11
3. Denodo and MPPs
• Currently, Denodo can integrate with a variety of MPP Data Lake engines
• Hive, Impala, Presto, SparkSQL, Athena and Databricks/Delta Lake
• The integration with this systems covers multiple capabilities
• As a data source
• As the storage for Denodo’s cache
• As an additional external execution engine
• Denodo can move data on-the-fly from other sources to the MPP for
execution
• Controlled by the Cost Based Optimizer, can also be used by manual hints
• As a target for replication pipelines (remote tables)
• Run a query in Denodo and put the results in the data lake
• If source and target are the same (e.g. data lake), data is processed
directly instead of coming to Denodo first
• Better support for incremental updates in data lake engines
12. 12
Denodo ech al s
rod ct Deep Di e eries
join
Group by ZIP
join
Group by ZIP
3. Massive Parallel Processing: Example
2M rows
(sales by customer)
Customer
(2M rows)
System Execution Time Optimization Techniques
Others ~ 10 min Basic
No MPP 43 sec Aggregation push-down
With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes)
Sales
(300 million rows)
join
Group by ZIP
1. Partial Aggregation
push down
Maximizes source processing
Reduces network traffic
3. On-demand data transfer
For SQL-on-Hadoop systems,
Denodo automatically generates
and upload Parquet files
4. Integration with local
and pre-cached data
The engine detects when data
Is cached or a is native table
in the MPP
2. Integrated with Cost Based Optimizer
Based on data volume estimation and
the cost of these particular operations,
the CBO can decide to move all or part
Of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto and Impala
For fast analytical processing in
inexpensive Hadoop-based solutions
With MPP Integration
group by
customer ID
13. 13
Denodo ech al s
rod ct Deep Di e eries
3. Future: Embedded MPP execution engine
▪ Customer with existing data lakes are encourage to use
their existing ones with equivalent capabilities
▪ However, an embedded engine has some advantages
▪ High performant MPP queries over data in distributed
filesystems without the need of additional software
▪ Out-of-the-box MPP options for caching and acceleration
capabilities
▪ Efficient integrated store for large volumes of active metadata
/ query history to enable upcoming AI capabilities
▪ Integrated security, deployment configuration and
management
14. 14
Denodo ech al s
rod ct Deep Di e eries
3. MPP Integration Timeline
1. Preliminary phase (Q2 2020)
▪ Evaluate and choose engine: PrestoSQL
2. Phase I (Q3 2020)
▪ Scripts / templates to automatically create and manage a cluster
running on Kubernetes (on-prem or cloud with EKS/AKS)
3. Phase II (Q1 2021)
▪ Automatic metadata integration. Files in distributed file system are
automatically accessible from Denodo
▪ Automatically create presto Tables and Denodo Base Views from
path to parquet Files.
▪ To access datasets from Presto, data files from AWS S3 (or Azure
Blob Storage, or Azure Data Lake, etc..) are mapped to tables in the
Hive Metastore.
4. Phase III
▪ Tighter UI integration (explore filesystem graphically)
5. Final Phase (Denodo 9)
▪ Deployment, management and monitoring is fully integrated in
Denodo's Solution Manager
This integration will be done using a phased approach, so that many of this capabilities can be released periodically
before the next major version
17. 17
Next Steps
Get Started Today
Denodo Standard Free Trial
Try Denodo Standard free for 30 days
on your choice of cloud environment
denodo.com/free-trials