Ran van den Boom, 30 Novembre - 1 Dicembre 2021 -
Webinar: Sistemi moderni di integrazione dei dati: l’esperienza dell’Istat e di altri attori
Titolo: Data Virtualization at Statistics Netherlands
1. Data virtualization at
Statistics Netherlands
“Modern data integration systems: the
experience of Istat and other institutes”
Ran van den Boom
Program Manager Data Strategy | Statistics Netherlands
30.11-1.12//2021
2. Definition of Data Integration according to Wikipedia: “Data integration involves combining data residing in
different sources and providing users with a unified view of them”
We need to describe phenomena, we need to increase the efficiency and we have a lot of challenges to
implement all innovations. How is data integration and data virtualization going to help us?
o Why
o What
o How
o Lessons learned so far
What are the secret ingredients?
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
Indice / agenda
2
3. • Increasing demand of our statistics
• Strengthening of phenomenon-oriented
measurement & description of society such as
Covid-19 effects on the economy, on social
division
• Increase the role of CBS as data partner of the
Government
This requires trusted data, faster time-to-market,
data-driven work, and more collaboration
between various organizations.
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
Why: external drivers for change
3
4. • External drivers lead to Innovation in existing processes and future-focused trajectories (new
data sources)
• Internal processes need to become more efficient
• Being able to access more sources, also external
• Becoming more effective: understand the data
This requires innovation of our processes:
• Introduction of silos for data in rest at interfaces
• Helps to unravel process steps (data on the move) from interfaces (data in rest)
• Avoiding stove pipes, data driven instead of process driven
• More options for gathering external data sources, flexible and standardized
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
Why: internal drivers for change
4
5. DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
Our current situation on data integration
5
(Source: Wikipedia)
Ideal database:
everything fits
6. DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
Our current situation on data integration
6
(Source: Wikipedia)
Ideal database:
everything fits
7. In our situation several islands, with hardly any connection
• Microdata stored in Data Service Centre – but not all
• Published data stored in StatLine – but not all
• Raw data in Data collection
• Other datasets stored anywhere
2012: Data Virtualization combines disparate data sources into a
single “virtual” data layer that provides integrated data services to
consuming applications in real-time.
Could that be a solution?
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
Data Integration = Sharing Data
Statistical process,
maturity levels
7
1st interface: raw data
2nd interface: standardized data
3d interface: processed data
4th interface: statistics
5th interface: published data
External sources
Surveys
8. 1. Being able to share data
2. Being able to share metadata
3. Governance
Our secret ingredients:
4. Collaboration (WII4Me), Agile approach at first
Note: this is a simplified picture
We also need Storage facilities, Privacy Preserving Techniques and tools, Tools for preparing
metadata and data for publishing, Publishing capabilities for external metadata, etc.
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
8
What do we need to support the Why
1st interface: raw data
2nd interface: standardized data
3d interface: processed data
4th interface: statistics
5th interface: published data
Sharing data
Metadata
Governance
9. Metadata
• Fit for purpose
• Patterns
• Maturity levels
• Metadata modelling
• Requirements
• Metadata Management System
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
9
What do we need to support the Why: the products
1st interface: raw data
2nd interface: standardized data
3d interface: processed data
4th interface: statistics
5th interface: published data
Sharing data
Metadata
Governance
Data Abstraction
Layer (DAL)
Denodo
Metadata catalog
including
taxonomy
Best practices
10. • People
• Knowledge
• Processes
• Organization
• Governance
Using several approaches:
• Change management
• Agile
• Project Management
• Step by step
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
10
How do we realize that?
Data Abstraction Layer
Source A
Source B Source C
Source D
Other metadata
(MMS)
Classifications,
codelists (CLS)
Metadata catalog and search engine
Taxonomy
Solutions
11. DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
11
Demo: a Data Abstraction Layer is not abstract at all
Advantages
• One standardized method of
accessing data (SQL)
• Dataset (“view”) is created
run-time
• No copies
• Overview of all datasets
• Regardless the location of
the data or its shape/format
Disadvantages
• Source system is used for
queries
• Permissions, technical
access
MS Access to retrieve data through the DAL
Denodo Data Catalog
Designing a view
12. Steps to achieve these ambitions
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
12
Lessons learned so far: strategy
Maturity level
of data
integration
Characteristics Purpose Requires
Low Overview of datasets Find and use data DAL/Denodo
Create views
Medium Well described datasets Understanding the data
Explaining phenomena
Automated validation
Resolving inconsistencies
Metadata MMS (COTS), CLS
(own development)
High Standardized, harmonized
metadata
Metadata-driven processes
Data sharing
Automatically generate datasets
based on metadata
Information Dialog
Efficiency
Internal discussions
Time (years) and resources
Adapted processes, change
management
Tools
Here we are
13. Challenges
Finding use cases: nobody could imagine what it
would mean (WII4Me) – at first
Lack of expertise, technical issues
Difficult to get every one on the same line for the
governance: stability vs. autonomy led to
discussion control vs. anarchy
Too many innovations going on at the same time
Organizing this requires more than an Agile
approach
Success factors
Support from management, one goal
One Business Owner
Find the opportunities, e.g. distribution problem for
owners of many datasets; heavy consumers of
data with no overview (e.g. national accounts,
large companies)
Create show cases
Denodo training, hire external expertise
Reduce the number of innovations: other metadata
is postponed until 2023; steering the innovations is
subject of discussions
Introduction of Service Owners to implement Data
sharing and Metadata
DATA VIRTUALIZATION AT STATISTICS NETHERLANDS | RAN VAN DEN BOOM
13
Lessons learned so far
14. grazie
Ran van den Boom
Program Manager Data Strategy | Statistics Netherlands
r.vandenboom@cbs.nl
per l’attenzione
Notas do Editor
Islands of data and metadata, hardly any connection and some of them are missing
Islands of data and metadata, hardly any connection and some of them are missing
We hoped to be further after 9 years, but: it’s no dolce vita