RDBMS gave us table schemas. A table schema, which is an essential metadata component, gave us the power to validate data types, and enforce constraints. In the age of varying data and schema-less data stores, how can we enforce these rules and how can we leverage metadata (even in RDBMS) to empower data validity, code checks, and automation.
This is a brief background into Big data (data lake) to put in context the importance of metadata from a governance perspective and more especially in todays heterogeneous big data platforms.
2. Aim
• Outline some perspective on metadata management principles that
apply in the big data space and beyond
• Provide some data governance foundations in the data space that
essentially would outlast the actual technologies to serve the needs
of the future
• Discuss technologies and solutions currently in the market
3. Big Data - Overview
• Big Data 5 V’s:
• Volume
• Velocity
• Variety
• Veracity
• Platform of today – set of relatively split up components
• Data is stored on HDFS – File system
• Catalogue of Data and its schema is maintained in another service – TBC!
• Query front ends – Query engines based on different requirements
=> Value
4. Platform Architecture – Modern Architecture
DataSources
Acquisition
DataSystems
Staging Zone
ETL and data
standardization
Pristine Archive
Compressed
Gzip etc.
Data Warehouse
Immutable data
Analytics Zone
Allocated data changes
Schema Catalogue
Well-define reference to data structures and attributes
Data Ledger
Track data and its access with lineage and operations
BigDataPlatform Data Marts UI/API
Apps
Source: Hortonworks
5. Why do we need to manage metadata for Big
data platforms?
Large volumes of data
landing in Hadoop/Big Data
Growing users working
with the data
The need for effective
control & consumption
of Data
The implementation needs to:
• Offer good data visibility across your cluster
• Capture data lineage across source systems and in the platform
• Audit and record operations that are performed in the platform
• Enforce policies that are defined by the platform stewards
• Help reduce data redundancy on the platform
Source: Cloudera
7. Metadata – What is it?
Data about Data!
• Business Metadata
Supplies the business context around data, such as the business term’s
name, definition, owners or stewards, and associated reference data
• Technical Metadata
provides technical information about the data, such as the name of the
source table, the source table column name, and the data type (e.g.,
string, integer)
• Operational Metadata
furnishes information about the use of the data, such as date last updated,
number of times accessed, or date last accessed
Source: Informatica
8. Why do I need all this metadata?
• Data lake will contain all types of data – log streams - kafka,
DBS – sqoop… don’t make your lake turn into a swamp!
• Consistency of definitions - To reconcile the difference in terminology such as
"clients" and "customers," "revenue" and "sales”
• Clarity of data lineage – About origins of a data set and can be granular enough to
define information at the attribute level, including operations on it
• To understand data usage on your cluster
• Optimize queries and views
• Compliance and Regulatory
• Compliance -Capture, store and move data – Sarbanes-Oxley, HIPAA,
Basel II
• Security - Authorization, Authentication – Handling sensitive data
• Auditing - Recoding every attempt to access
• Archive & Retention - Data life cycle policies Source: Teradata/Tech
target
9. Metadata System Architecture
Topologically, metadata repository architecture defines one of the following
three styles:
• Centralized Metadata repository
Efficient access and adaptability, scalability and high performance
Single point of failure and continuous synchronization
• Distributed Metadata repository
Access to metadata repo in real-time, up-to date metadata
Overhead in maintaining the configuration of the source system changes and
HA
• Federated or Hybrid Metadata repository
Central definition storage with references to the proper locations of the
accurate definitions
Source: Techtarget
11. Use Cases – Analytics
1. Finding the Data: Data Scientists spend a lot time finding the
correct columns for variable selection
• Around 80% of the data scientist’s time on column investigation with SMEs
2. Profile of Data: Reduce the number of time spent on data profiling
by the ad-hoc queries
• ~78% of the queries run on the cluster are profiling queries
3. Track the transformation: Data Scientists would like to understand
how the data sets are derived
• Not fully tracked except at a high level
Source: Aetna
12. 1. Finding the data: Challenges
• Hive requires relatively manual traversal of the schema to find the
table and columns
• HDFS also requires traversal of the directory listing to find a file
• Any documentation (external to the system) become outdated and
are not always reliable
• No simple way to add business metadata
Source: Aetna
14. 1. Finding the data: Solutions
• Run-time capture of metadata of hive and HDFS, and store in
repository
• Provide an API to query the metadata and search across it
• Provide an API or other ways to enrich the data with its business
context
Business Metadata
Technical/Physical Metadata
Hive
HDFS
Ingestion/
Sqoop
Apache Atlas
Source: Aetna
15. 2. Profile of Data: Challenges and Solutions
• Access to hive metastore will
introduce latency in production
• Lack of comprehensive information
provided by the hive metastore
78%
18%
4%
Average Daily Query
Profiling Exploratory Production
• Provide a system with business,
technical data that are cross referenced
• Have a framework for the data scientist
to accommodate additional profiling
Source: Aetna
16. 3. Track the transformation: Challenges and
Solutions
• Documenting transformation is manual
and difficult to scale
• Mechanism for auditing data pipeline
still lacking
• Data quality and provenance is too
manual
• Leverage metadata already captured to
construct transformations
• Provide an API to query transformations
• Provide a visualization for the
transformations
Source: Aetna
17. What do we need?
1. A Searchable platform for all the data types for business and technical
metadata
2. Data profile store with basic metrics of the data
• Min
• Max
• Column distribution
3. Visual lineage for the data flow from the source system to different
components within the platform
• ETL operations – HL view
• Analytics queries
4. Automated Metadata driven data ingestion and thus management
• The Data Lake concept relies on capturing a robust set of attributes for every piece of content
within the lake
• Maintaining this metadata requires a highly-automated metadata extraction, capture, and
tracking facility.
19. Apache Atlas – deep dive
• Apache Atlas Capabilities: Overview
• Data Classification
• Import or define taxonomy business-oriented annotations
for data
• Define, annotate, and automate capture of relationships
between data sets
• Export metadata to third-party systems
• Centralized Auditing
• Capture security access information
• Capture the operational information for execution, steps,
and activities
• Search & Lineage (Browse)
• Text-based search features locates relevant data and audit
event across Data Lake quickly and accurately
• Browse visualization of data set lineage allowing users to
drill-down into operational, security, and provenance
related information
• Security & Policy Engine
• Rationalize compliance policy at runtime based on data
classification schemes
Source: Hortonworks
Open-source Incubator project
23. Possible Solutions for other Platforms
Metacat
• Apply Metadata management
on Service layer
• Federated metadata catalog for
the whole data platform
• Proxy service to different
metadata sources
• Data metrics, data usage,
ownership, categorization and
retention policy …
• Common interface for tools to
interact with metadata
Tracking Data Difference
• Apply Metadata management
on Service layer
• Track the changes to
documents/entities
• Custom code tracking through
logs collected as Mongo, or use
a module called MongoID
Netflix OSS
24. Where Else?
{ "Description": "A containerized foobar",
"Usage": "docker run --rm example/foobar [args]",
"License": "GPL",
"Version": "0.0.1-beta",
"aBoolean": true,
"aNumber" : 0.01234,
"aNestedArray": ["a", "b", "c"] } <meta name=”description” content=”155
characters of message matching text
with a call to action goes here”>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.JOSA.Meta</groupId>
<artifactId>project</artifactId> <version>1.0</version>
</project>
25. Notes - Summary
• Consider the different types of metadata you need to manage
• Build a robust descriptive dictionary for the data
• Manage metadata as a team effort. It has a lot of benefit so make it
Agile but effective.
Finally…remember that
One’s Metadata – d/dx – is someone else’s Data!