The document describes an architecture for semantically integrating enterprise data lakes. It proposes a knowledge graph that links metadata, data models and key performance indicators to provide a common meaning for data. Raw data is stored in a data lake and ingested from various sources. A metadata layer captures dataset metadata, ontologies and integration rules to link disparate data. An interface allows users to access consolidated views generated by executing queries on Hadoop. The process involves cataloging, discovering, lifting, linking and validating datasets to integrate them based on rules into the knowledge graph.
2. MOTIVATION
Enterprise Data Management Objective:
“Ensure all data is aligned to a common meaning
in order to achieve automation in performing
complex analytics and generating trusted
reports.”
Source:
2015 Data Management Industry Benchmark -
EDM Council
September 26,
2016
2
In 2015 only 7% of
respondents claim to
already be using shared
and unambiguous
definitions of data across
the firm and have it
accessible as operational
metadata.
7%
3. ARCHITECTURE
September 26,
2016
3
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Corporate
Memory
Inbound
Data Sources
Outbound and
Consumption
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports Outbound Data Delivery to Target Systems
Big Data DWH-
Infrastructure
4. ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Data Ingestion
• Files in the data lake (CSV, XML, Excel)
• (relational) Databases
5. ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Data Lake
• Emerging approach to handle large amounts
of data
• Cost-effective storage
• Data is held in their native formats
Good
Does not force an up-front integration of the
ingested data sets
Bad
Retaining an overview of disparate data silos in
the lake without having a coherent shared view
is a challenging issue
6. ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Data Warehouses
• Existing infrastucture
• Typically relational databases
7. ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Metadata Layer
• Dataset Metadata
• Ontologies
• Integration Rules
8. ARCHITECTURE
Management
Accounting
Risk Management
Regulatory Reporting
Treasury MarketingAccounting
Inbound Raw Data Store
Knowledge Graph for Meta Data, KPI Definition and Data Models
Frontend to Access Relationship and KPI Definition /
Documentation
Frontend to Access (ad hoc) Reports
Outbound Data Delivery to
Target Systems
Big Data
DWH-
Infrastructure
Graphical User Interface
Customer Applications
9. INTEGRATION PROCESS
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
September 26,
2016
9
10. DATASET MANAGEMENT
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
September 26,
2016
10
11. DATASET CATALOG
• Enables the user to explore and manage datasets in the data lake
• Files in the data lake (CSV, XML, Excel)
• Databases (Apache Hive or external databases)
September 26,
2016
11
12. MANAGING METADATA
• Exploring and editing dataset metadata
• Semantic content information, like textual
descriptions, tags and related Persons
• Technical information and parameters, like
formats, data model and encoding
• Access information, like access path or URL,
source system or API call
• Organizational provenance, like
organizational units owning or maintaining
the dataset
September 26,
2016
12
13. DATASET DISCOVERY
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
September 26,
2016
13
14. DATASET DISCOVERY
• Goal: Augment a dataset with data from related datasets
• Automatic discovery of dataset with overlapping information
• Explorative interface
• Discovery is based on two data parts
• Business meta data
• Profiling summary
September 26,
2016
14
15. DISCOVERY VIEW
• Datasets are matched based on their metadata (profiling + business data)
September 26,
2016
15
16. DATASET PROFILING
• Datasets often contain implicit and explicit schema information
• Column names, data formats, enumerated values etc.
• Example: column contains formatted dates
• Idea: Extract a dataset summary
• For each column / property the summary contains:
1. Data type (e.g., number, date, industry classification)
2. Data format (e.g., date format)
3. Data statistics (e.g., range, distribution, most frequent values)
• Materialized as RDF with UI view
September 26,
2016
16
17. DETECTING DATA TYPES
• Detecting common datatypes as well as user-defined types
• Common datatypes
• Numbers
• Dates / Times
• Geographic locations (geo-coordinates, states, countries)
• User-defined data types can be integrated by adding an ontology /
taxonomy
• Usually a SKOS taxonomy
• Managed as another dataset in the dataset management
• Example: Industry taxonomy
• Standard taxonomy (NACE, SIC, NAICS) or company specific
September 26,
2016
17
18. FORMATS AND STATISTICS
• For some types, the data format is detected
• Example: Dates are formatted in DD-MM-YYYY
• Two functions are generated:
1. Parser that is able to read the detected representation
2. Normalizer that converts the parsed values into a configurable, organization-wide
target representation
• Statistics summarize the values:
• Value range and distribution
• Most frequent values
• Data selectivity
September 26,
2016
18
19. DISCOVERY VIEW
• Datasets are matched based on their metadata (profiling + business data)
Septemb
er 26, 2016
19
20. INTEGRATION PROCESS
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
September 26,
2016
20
21. DATA INTEGRATION
• The integration process is driven by a set of rules
• Lifting Rules map the source datasets to a ontology
• Linking Rules connect different datasets to a knowledge graph
• Rules are operator trees, consisting of four types of operators
• Data Access Operators
• Transformation Operators
• Similarity Operators
• Aggregation Operators
• Rules can be learned using genetic programming algorithms
• Rules are human understandable and can be edited
September 26,
2016
21
22. DATASET LIFTING
• Objective: Map the datasets in the data lake to a consistent vocabulary.
• A lifting rule consists of a number of mappings
• Each mapping assigns a term in the original data set (such as a column for tabular data)
to a term in the target ontology (such as a property provided by an ontology).
• Multiple mappings for each dataset can be managed to allow different
views on the same data.
• Initial mappings are generated automatically based on the profiling results
from where the user can continue to build on.
September 26,
2016
22
23. LIFTING EXAMPLE
September 26,
2016
23
Bond ISIN Country Industry
NEDWBK CAD 5,2%25 CA639832AA25 Canada Banking
SIEMENSF1.50%03/20 DE000A1G85B4 Germany Electrical
Equipment
Electricite de France
(EDF), 6,5% 26jan2019
USF2893TAB29 France Utilities
NEDWBK CAD 5,2%25
fibo:hasSecurityIdentifier
Utilities
Industry Ontology
Banking
France
Country Ontology
Germany
EMEA
“CA639832AA25”
fibo:legallyRecordedIn
fibo:industrySector
24. LINKING
• Goal: Connect individual datasets to a knowledge graph
• Identify related entities in different datasets and link them
• Either entities describing the same real world object or another relation
September 26,
2016
24
NEDWBK CAD 5,2%25
ratingScore
Industry OntologyCountry Ontology
EMEA
“AAA”
fibo:legallyRecordedIn
fibo:industrySector
Rating CAD 5,2%25
hasRating
fibo:industrySector
fibo:legallyRecordedIn
25. LINKAGE RULES
• Linking is based on domain-specific rules
• Specify the conditions that must hold true for two entities to be linked
September 26,
2016
25
26. LEARNING LINKAGE RULES
Problem: Manually writing rules is time-consuming and requires expertise
Approach: Interactive machine learning algorithm for generating rules
• Generates a rule based on a number of user-confirmed link candidates.
• Link candidates are actively selected by the learning algorithm to include link candidates
that yield a high information gain.
• The user does not need any knowledge of the characteristics
of the dataset or any particular similarity computation techniques.
September 26,
2016
26
27. INTEGRATION PROCESS
Dataset
Management
•Catalog Datasets
•Catalog Ontologies
•Manage Metadata
Dataset Discovery
•Data Profiling
•Dataset Exploration
Dataset Integration
•Dataset Lifting
•Dataset Linking
•Data Quality Validation
Data Access
•Domain Specific
Consolidated Views
•Execution on Hadoop
28. VIEW GENERATION
• The user selects a set of lifted and linked datasets
September 26,
2016
28
29. Hadoop
Data Lake
DATA ACCESS
• Generate data flows based on
Apache Spark
• The data flows utilize Resilient
Distributed Datasets (RDDs)
• RDDs derive new data sets from
existing data sets by applying a
chain of transformations
• A derived data set can either
• be recomputed on-the-fly
• persisted on stable storage
• Data flows can be executed
efficiently on Hadoop clusters.
September 26,
2016
29
Corporate
Bonds
Data Lifting 1
(Apache Spark
RDD)
Data Linking
(Apache Spark RDD)
Internal
Ratings
Data Lifting 2
(Apache Spark
RDD)
External
Ratings
Data Lifting 3
(Apache Spark
RDD)
eccenca
Corporate
Memory
Data
Consumer
SQL CSV
Excel
Spark
API