Why Your Data Science Architecture Should Include a Data Virtualization Tool (New Zealand)

Why Your Data Science
Architecture Should Include
a Data Virtualization Tool
Chris Day
Director, APAC Sales Engineering
cday@denodo.com
CDAO NEW
ZEALAND
3 - 5 NOVEMBER 2020

Agenda• Advanced Analytics & Machine Learning
• The Data Challenge
• Tackling the Data Preparation Tasks Problem
• Customer Story
• Q&A

3
VentureBeat AI, July 2019
87% of data science projects never make it
into production.

4
Advanced Analytics & Machine Learning Exercises Need Data
Improving Patient
Outcomes
Data includes patient demographics,
family history, patient vitals, lab test
results, claims data etc.
Predictive Maintenance
Maintenance data logs, data coming in
from sensors – including temperature,
running time, power level duration etc.
Predicting Late Payment
Data includes company or individual
demographics, payment history,
customer support logs etc.
Preventing Frauds
Data includes the location where the
claim originated, time of the day,
claimant history and any recent adverse
events.
Reducing Customer Churn
Data includes customer demographics,
products purchased, products used, pat
transaction, company size, history,
revenue etc.

7
What is Data Virtualization?
Consume
in business applications
Combine
related data into views
Connect
to disparate data
sources
2
3
1

8
Data Virtualization Architecture Diagram
DATA CONSUMERS
Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users, IoT/Streaming Data
Multiple Protocols,
Formats
Query, Search,
Browse
Request/Reply,
Event Driven
Secure
Delivery
DATA CONSUMERSAnalytical Operational
Web
Services
DISPARATE DATA SOURCES
Databases & Warehouses, Cloud/SaaS Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word...
Less StructuredMore Structured
SQL,
MDX
Big Data
APIs
Web Automation
and Indexing
DATA VIRTUALIZATION
CONNECT COMBINE CONSUME
Share, Deliver,
Publish, Govern,
Collaborate
Discover, Transform,
Prepare, Improve
Quality, Integrate
Normalized
Views of
Disparate Data
Agile Development
Performance
Data Services
Resource
Management
Data Catalog
Governance
& Metadata
Security and
Data Privacy
Lifecycle
Management

Tackling the Data Pipeline Problem

10
Typical Data Science Workflow
A typical workflow for a data scientist is:
1. Gather the requirements for the business problem
2. Identify useful data
▪ Ingest data
3. Cleanse data into a useful format
4. Analyze data
5. Prepare input for your algorithms
6. Execute data science algorithms (ML, AI, etc.)
▪ Iterate steps 2 to 6 until valuable insights are
produced
7. Visualize and share
Source:http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/

11
Where Does Your Time Go?
• 80% of time – Finding and
preparing the data
• 10% of time – Analysis
• 10% of time – Visualizing data
Source:http://sudeep.co/data-science/Understanding-the-Data-Science-Lifecycle/

12
Where Does Your Time Go?
A large amount of time and effort goes into tasks not intrinsically related to data science:
• Finding where the right data may be
• Getting access to the data
▪ Bureaucracy
▪ Understand access methods and technology (noSQL, REST APIs, etc.)
• Transforming data into a format easy to work with
• Combining data originally available in different sources and formats
• Profile and cleanse data to eliminate incomplete or inconsistent data points

13
Data Scientist Workflow
Identify useful
data
Modify datainto
auseful format
Analyzedata Execute data
science algorithms
(ML,AI, etc.)
Prepare for
MLalgorithm

14
Identify Useful Data
If the company has a virtual layer with a good coverage of
data sources, this task is greatly simplified.
▪ A data virtualization tool like Denodo can offer
unified access to all data available in the company.
▪ It abstracts the technologies underneath, offering a
standard SQL interface to query and manipulate.
To further simplify the challenge, Denodo offers a Data
Catalog to search, find and explore your data assets.

15
Data Scientist Workflow
Identify useful
data
Modify datainto
auseful format
Analyzedata Execute data
science algorithms
(ML,AI, etc.)
Prepare for
MLalgorithm

16
Data Virtualization offers the unique opportunity of
using standard SQL (joins, aggregations,
transformations, etc.) to access, manipulate and
analyze any data.
Cleansing and transformation steps can be easily
accomplished in SQL.
Its modeling capabilities enable the definition of views
that embed this logic to foster reusability.
Ingestion And Data Manipulation Tasks

Tackling the Large Dataset Problem

18
It enables the persistence of aggregates to accelerate the execution of analytical
queries
▪ Common joins, aggregations and filters can be precomputed (in the cache or in a data source)
and used as starting points to accelerate queries
Smart Query Acceleration for Analytics
Store
400 rows
Sales
300 M rows
Sales by
Customer, Store, Date
Summary
Sales by
Customer
Summary
S S
Customer
1M rows
Sales by date, with store and customer information ?
Sales with customer information ?

19
Smart Query Acceleration for Analytics: Summaries
Summaries : Commonly joined fact & dimensions are precomputed and used to accelerate future queries.
System Execution
time
Other systems >500 secs
Denodo(without summary) ~13 secs
Denodo(with summary) ~1.4 secs
• Historical Sales – 220 mil rows
• Trailing Twelve Months Sales – 68M rows
• Date_Dim – 73K rows
• Store – contains StoreID, store_name and City
• “Summ 1” – summary, 300K rows
“Total Sales by StoreID and Day” query, stored
in Redshift

21
McCormick Uses Denodo to Provide Data to Its AI Project
Background
▪ McCormick’s AI and machine learning based project required data
that was stored in internal systems spread across 4 different
continents and in spreadsheets.
▪ Portions of data in the internal systems and spreadsheets that
were shared with McCormick's research partner firms needed to be
masked and at the same time unmasked when shared internally.
▪ McCormick wanted to create a data service that could simplify the
process of data access and data sharing across the organisation
and be used by the analytics teams for their machine learning
projects.

22
• Data Quality
• Multiple Brands
• Which Data to Use?

23
McCormick – Multi-purpose Data Lake
Solution Highlights
▪ Agile Data Delivery
▪ High Level of Reuse
▪ Single Discovery & Consumption
Platform

24
Data Virtualization Benefits for McCormick
▪ Machine learning and applications were able to
access refreshed, validated and indexed data in
real time, without replication, from Denodo
enterprise data service.
▪ The Denodo enterprise data service gave the
business users the capability to compare data in
multiple systems.
▪ Spreadsheets now the exception.
▪ Ensure the quality of proposed data and services.

25
Data Virtualization Benefits for AI and Machine Learning Projects
✓ Denodo can play key role in the data science ecosystem to reduce data
exploration and analysis timeframes.
✓ Extends and integrates with the capabilities of notebooks, Python, R, etc.
to improve the toolset of the data scientist.
✓ Provides a modern “SQL-on-Anything” engine.
✓ Can leverage Big Data technologies like Spark (as a data source, an
ingestion tool and for external processing) to efficiently work with large
data volumes.
✓ Facilitates collaboration across the data community as a single platform
for all data requirements.

27
https://denodo.link/34diju2

28
Virtual Hands-on Lab
Thursday 26 November 2020
8:30am – 12:00pm
https://denodo.link/3kjtQgR

Test Drive
Access Denodo Platform in the Cloud!
Take a Test Drive today!
GET STARTED TODAY
https://denodo.link/3kjtNSd

Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.

Why Your Data Science Architecture Should Include a Data Virtualization Tool (New Zealand)

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Why Your Data Science Architecture Should Include a Data Virtualization Tool (New Zealand)

Semelhante a Why Your Data Science Architecture Should Include a Data Virtualization Tool (New Zealand) (20)

Mais de Denodo

Mais de Denodo (20)

Último

Último (20)

Why Your Data Science Architecture Should Include a Data Virtualization Tool (New Zealand)