SlideShare a Scribd company logo
1 of 27
Metadata Management
in Big Data
Data Management Challenges
@ezzibdeh
Tariq Ezzibdeh
Aim
• Outline some perspective on metadata management principles that
apply in the big data space and beyond
• Provide some data governance foundations in the data space that
essentially would outlast the actual technologies to serve the needs
of the future
• Discuss technologies and solutions currently in the market
Big Data - Overview
• Big Data 5 V’s:
• Volume
• Velocity
• Variety
• Veracity
• Platform of today – set of relatively split up components
• Data is stored on HDFS – File system
• Catalogue of Data and its schema is maintained in another service – TBC!
• Query front ends – Query engines based on different requirements
=> Value
Platform Architecture – Modern Architecture
DataSources
Acquisition
DataSystems
Staging Zone
ETL and data
standardization
Pristine Archive
Compressed
Gzip etc.
Data Warehouse
Immutable data
Analytics Zone
Allocated data changes
Schema Catalogue
Well-define reference to data structures and attributes
Data Ledger
Track data and its access with lineage and operations
BigDataPlatform Data Marts UI/API
Apps
Source: Hortonworks
Why do we need to manage metadata for Big
data platforms?
Large volumes of data
landing in Hadoop/Big Data
Growing users working
with the data
The need for effective
control & consumption
of Data
The implementation needs to:
• Offer good data visibility across your cluster
• Capture data lineage across source systems and in the platform
• Audit and record operations that are performed in the platform
• Enforce policies that are defined by the platform stewards
• Help reduce data redundancy on the platform
Source: Cloudera
Metadata in Action
Metadata – What is it?
Data about Data!
• Business Metadata
 Supplies the business context around data, such as the business term’s
name, definition, owners or stewards, and associated reference data
• Technical Metadata
 provides technical information about the data, such as the name of the
source table, the source table column name, and the data type (e.g.,
string, integer)
• Operational Metadata
 furnishes information about the use of the data, such as date last updated,
number of times accessed, or date last accessed
Source: Informatica
Why do I need all this metadata?
• Data lake will contain all types of data – log streams - kafka,
DBS – sqoop… don’t make your lake turn into a swamp!
• Consistency of definitions - To reconcile the difference in terminology such as
"clients" and "customers," "revenue" and "sales”
• Clarity of data lineage – About origins of a data set and can be granular enough to
define information at the attribute level, including operations on it
• To understand data usage on your cluster
• Optimize queries and views
• Compliance and Regulatory
• Compliance -Capture, store and move data – Sarbanes-Oxley, HIPAA,
Basel II
• Security - Authorization, Authentication – Handling sensitive data
• Auditing - Recoding every attempt to access
• Archive & Retention - Data life cycle policies Source: Teradata/Tech
target
Metadata System Architecture
Topologically, metadata repository architecture defines one of the following
three styles:
• Centralized Metadata repository
 Efficient access and adaptability, scalability and high performance
 Single point of failure and continuous synchronization
• Distributed Metadata repository
 Access to metadata repo in real-time, up-to date metadata
 Overhead in maintaining the configuration of the source system changes and
HA
• Federated or Hybrid Metadata repository
 Central definition storage with references to the proper locations of the
accurate definitions
Source: Techtarget
Use-cases for the need for Metadata
Use Cases – Analytics
1. Finding the Data: Data Scientists spend a lot time finding the
correct columns for variable selection
• Around 80% of the data scientist’s time on column investigation with SMEs
2. Profile of Data: Reduce the number of time spent on data profiling
by the ad-hoc queries
• ~78% of the queries run on the cluster are profiling queries
3. Track the transformation: Data Scientists would like to understand
how the data sets are derived
• Not fully tracked except at a high level
Source: Aetna
1. Finding the data: Challenges
• Hive requires relatively manual traversal of the schema to find the
table and columns
• HDFS also requires traversal of the directory listing to find a file
• Any documentation (external to the system) become outdated and
are not always reliable
• No simple way to add business metadata
Source: Aetna
HDFS/Hive Architecture
hadoop.apache.org
Ben Lever -Slideshare
Source:
1. Finding the data: Solutions
• Run-time capture of metadata of hive and HDFS, and store in
repository
• Provide an API to query the metadata and search across it
• Provide an API or other ways to enrich the data with its business
context
Business Metadata
Technical/Physical Metadata
Hive
HDFS
Ingestion/
Sqoop
Apache Atlas
Source: Aetna
2. Profile of Data: Challenges and Solutions
• Access to hive metastore will
introduce latency in production
• Lack of comprehensive information
provided by the hive metastore
78%
18%
4%
Average Daily Query
Profiling Exploratory Production
• Provide a system with business,
technical data that are cross referenced
• Have a framework for the data scientist
to accommodate additional profiling
Source: Aetna
3. Track the transformation: Challenges and
Solutions
• Documenting transformation is manual
and difficult to scale
• Mechanism for auditing data pipeline
still lacking
• Data quality and provenance is too
manual
• Leverage metadata already captured to
construct transformations
• Provide an API to query transformations
• Provide a visualization for the
transformations
Source: Aetna
What do we need?
1. A Searchable platform for all the data types for business and technical
metadata
2. Data profile store with basic metrics of the data
• Min
• Max
• Column distribution
3. Visual lineage for the data flow from the source system to different
components within the platform
• ETL operations – HL view
• Analytics queries
4. Automated Metadata driven data ingestion and thus management
• The Data Lake concept relies on capturing a robust set of attributes for every piece of content
within the lake
• Maintaining this metadata requires a highly-automated metadata extraction, capture, and
tracking facility.
Solutions for Hadoop
Apache Atlas – deep dive
• Apache Atlas Capabilities: Overview
• Data Classification
• Import or define taxonomy business-oriented annotations
for data
• Define, annotate, and automate capture of relationships
between data sets
• Export metadata to third-party systems
• Centralized Auditing
• Capture security access information
• Capture the operational information for execution, steps,
and activities
• Search & Lineage (Browse)
• Text-based search features locates relevant data and audit
event across Data Lake quickly and accurately
• Browse visualization of data set lineage allowing users to
drill-down into operational, security, and provenance
related information
• Security & Policy Engine
• Rationalize compliance policy at runtime based on data
classification schemes
Source: Hortonworks
Open-source Incubator project
Demo
Apache Atlas in action!
Possible solutions for other platforms
Netflix – Managing Data Platforms
Source: Netflix
Possible Solutions for other Platforms
Metacat
• Apply Metadata management
on Service layer
• Federated metadata catalog for
the whole data platform
• Proxy service to different
metadata sources
• Data metrics, data usage,
ownership, categorization and
retention policy …
• Common interface for tools to
interact with metadata
Tracking Data Difference
• Apply Metadata management
on Service layer
• Track the changes to
documents/entities
• Custom code tracking through
logs collected as Mongo, or use
a module called MongoID
Netflix OSS
Where Else?
{ "Description": "A containerized foobar",
"Usage": "docker run --rm example/foobar [args]",
"License": "GPL",
"Version": "0.0.1-beta",
"aBoolean": true,
"aNumber" : 0.01234,
"aNestedArray": ["a", "b", "c"] } <meta name=”description” content=”155
characters of message matching text
with a call to action goes here”>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.JOSA.Meta</groupId>
<artifactId>project</artifactId> <version>1.0</version>
</project>
Notes - Summary
• Consider the different types of metadata you need to manage
• Build a robust descriptive dictionary for the data
• Manage metadata as a team effort. It has a lot of benefit so make it
Agile but effective.
Finally…remember that
One’s Metadata – d/dx – is someone else’s Data!
Resources
• HDP 2.3 Preview Sandbox VM: (Hortonworks)
– http://hortonworks.com/hdp/whats-new/
• Apache Atlas:
– http://atlas.incubator.apache.org/
– http://incubator.apache.org/projects/atlas.html
– https://git-wip-us.apache.org/repos/asf/incubator-atlas.gi
• Metadata Management (General)
– https://www.informatica.com/content/dam/informatica-
com/global/amer/us/collateral/white-paper/metadata-management-data-
governance_white-paper_2163.pdf
tariqzibdeh@gmail.com
Tariq Ezzibdeh
Questions..?
Contact info:

More Related Content

What's hot

Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
Ryan Andhavarapu
 
Data Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkData Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural Framework
Dr. Sunil Kr. Pandey
 

What's hot (20)

Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Data Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best PracticesData Lake - Multitenancy Best Practices
Data Lake - Multitenancy Best Practices
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Information Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data LakesInformation Virtualization: Query Federation on Data Lakes
Information Virtualization: Query Federation on Data Lakes
 
Open Development
Open DevelopmentOpen Development
Open Development
 
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
Data Catalog in Denodo Platform 7.0: Creating a Data Marketplace with Data Vi...
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Data warehouse testing
Data warehouse testingData warehouse testing
Data warehouse testing
 
Reconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source SystemsReconciling your Enterprise Data Warehouse to Source Systems
Reconciling your Enterprise Data Warehouse to Source Systems
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
 
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase PowerWebinar -Data Warehouse Augmentation: Cut Costs, Increase Power
Webinar -Data Warehouse Augmentation: Cut Costs, Increase Power
 
Data Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural FrameworkData Warehousing & Basic Architectural Framework
Data Warehousing & Basic Architectural Framework
 
Applying Digital Library Metadata Standards
Applying Digital Library Metadata StandardsApplying Digital Library Metadata Standards
Applying Digital Library Metadata Standards
 
5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake5 Steps for Architecting a Data Lake
5 Steps for Architecting a Data Lake
 
Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
 
Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
 
Building a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's PerspectiveBuilding a Data Lake - An App Dev's Perspective
Building a Data Lake - An App Dev's Perspective
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
Dw & etl concepts
Dw & etl conceptsDw & etl concepts
Dw & etl concepts
 

Viewers also liked

Introduction to metadata management
Introduction to metadata managementIntroduction to metadata management
Introduction to metadata management
Open Data Support
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
StampedeCon
 
Big Data & Oracle Technologies
Big Data & Oracle TechnologiesBig Data & Oracle Technologies
Big Data & Oracle Technologies
Oleksii Movchaniuk
 

Viewers also liked (20)

Introduction to metadata management
Introduction to metadata managementIntroduction to metadata management
Introduction to metadata management
 
Big data perspective solution & technology
Big data perspective solution & technologyBig data perspective solution & technology
Big data perspective solution & technology
 
Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013Big Data Infrastructure and Analytics Solution on FITAT2013
Big Data Infrastructure and Analytics Solution on FITAT2013
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Documentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM BootcampDocumentation and Metdata - VA DM Bootcamp
Documentation and Metdata - VA DM Bootcamp
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
Introduction to Metadata
Introduction to MetadataIntroduction to Metadata
Introduction to Metadata
 
Putting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at NetflixPutting Lipstick on Apache Pig at Netflix
Putting Lipstick on Apache Pig at Netflix
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data Architecture
 
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...Practical guide to architecting data lakes -  Avinash Ramineni - Phoenix Data...
Practical guide to architecting data lakes - Avinash Ramineni - Phoenix Data...
 
HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011HCatalog Hadoop Summit 2011
HCatalog Hadoop Summit 2011
 
Big Data & Oracle Technologies
Big Data & Oracle TechnologiesBig Data & Oracle Technologies
Big Data & Oracle Technologies
 
Tools of our Trade (RDA, MARC21) 2010-03-15
Tools of our Trade (RDA, MARC21) 2010-03-15Tools of our Trade (RDA, MARC21) 2010-03-15
Tools of our Trade (RDA, MARC21) 2010-03-15
 
Standards Metadata Management (system)
Standards Metadata Management (system)Standards Metadata Management (system)
Standards Metadata Management (system)
 
Creando valor en la empresa: Lean y S-BPM | Evento: Creación de Valor: S-BPM ...
Creando valor en la empresa: Lean y S-BPM | Evento: Creación de Valor: S-BPM ...Creando valor en la empresa: Lean y S-BPM | Evento: Creación de Valor: S-BPM ...
Creando valor en la empresa: Lean y S-BPM | Evento: Creación de Valor: S-BPM ...
 
Real-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case studyReal-time big data analytics based on product recommendations case study
Real-time big data analytics based on product recommendations case study
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
Data Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management RequirementsData Lakes: 8 Enterprise Data Management Requirements
Data Lakes: 8 Enterprise Data Management Requirements
 
Data Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop ImplementationData Lake for the Cloud: Extending your Hadoop Implementation
Data Lake for the Cloud: Extending your Hadoop Implementation
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
 

Similar to JOSA TechTalk: Metadata Management
in Big Data

An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit
 

Similar to JOSA TechTalk: Metadata Management
in Big Data (20)

Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Various Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.pptVarious Applications of Data Warehouse.ppt
Various Applications of Data Warehouse.ppt
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Data warehouse
Data warehouseData warehouse
Data warehouse
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Data warehouseold
Data warehouseoldData warehouseold
Data warehouseold
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data LakesCrossing Analytics Systems: Case for Integrated Provenance in Data Lakes
Crossing Analytics Systems: Case for Integrated Provenance in Data Lakes
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 

More from Jordan Open Source Association

More from Jordan Open Source Association (20)

JOSA TechTalks - Data Oriented Architecture
JOSA TechTalks - Data Oriented ArchitectureJOSA TechTalks - Data Oriented Architecture
JOSA TechTalks - Data Oriented Architecture
 
JOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured DataJOSA TechTalks - Machine Learning on Graph-Structured Data
JOSA TechTalks - Machine Learning on Graph-Structured Data
 
OpenSooq Mobile Infrastructure @ Scale
OpenSooq Mobile Infrastructure @ ScaleOpenSooq Mobile Infrastructure @ Scale
OpenSooq Mobile Infrastructure @ Scale
 
Data-Driven Digital Transformation
Data-Driven Digital TransformationData-Driven Digital Transformation
Data-Driven Digital Transformation
 
Data Science in Action
Data Science in ActionData Science in Action
Data Science in Action
 
Processing Arabic Text
Processing Arabic TextProcessing Arabic Text
Processing Arabic Text
 
JOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your CostsJOSA TechTalks - Downgrade your Costs
JOSA TechTalks - Downgrade your Costs
 
JOSA TechTalks - Docker in Production
JOSA TechTalks - Docker in ProductionJOSA TechTalks - Docker in Production
JOSA TechTalks - Docker in Production
 
JOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec ExplainedJOSA TechTalks - Word Embedding and Word2Vec Explained
JOSA TechTalks - Word Embedding and Word2Vec Explained
 
JOSA TechTalks - Better Web Apps with React and Redux
JOSA TechTalks - Better Web Apps with React and ReduxJOSA TechTalks - Better Web Apps with React and Redux
JOSA TechTalks - Better Web Apps with React and Redux
 
JOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best PracticesJOSA TechTalks - RESTful API Concepts and Best Practices
JOSA TechTalks - RESTful API Concepts and Best Practices
 
Web app architecture
Web app architectureWeb app architecture
Web app architecture
 
Intro to the Principles of Graphic Design
Intro to the Principles of Graphic DesignIntro to the Principles of Graphic Design
Intro to the Principles of Graphic Design
 
Intro to Graphic Design Elements
Intro to Graphic Design ElementsIntro to Graphic Design Elements
Intro to Graphic Design Elements
 
JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts JOSA TechTalk: Realtime monitoring and alerts
JOSA TechTalk: Realtime monitoring and alerts
 
JOSA TechTalk: Introduction to Supervised Learning
JOSA TechTalk: Introduction to Supervised LearningJOSA TechTalk: Introduction to Supervised Learning
JOSA TechTalk: Introduction to Supervised Learning
 
JOSA TechTalk: Taking Docker to Production
JOSA TechTalk: Taking Docker to ProductionJOSA TechTalk: Taking Docker to Production
JOSA TechTalk: Taking Docker to Production
 
JOSA TechTalk: Introduction to docker
JOSA TechTalk: Introduction to dockerJOSA TechTalk: Introduction to docker
JOSA TechTalk: Introduction to docker
 
D programming language
D programming languageD programming language
D programming language
 
A taste of Functional Programming
A taste of Functional ProgrammingA taste of Functional Programming
A taste of Functional Programming
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

JOSA TechTalk: Metadata Management
in Big Data

  • 1. Metadata Management in Big Data Data Management Challenges @ezzibdeh Tariq Ezzibdeh
  • 2. Aim • Outline some perspective on metadata management principles that apply in the big data space and beyond • Provide some data governance foundations in the data space that essentially would outlast the actual technologies to serve the needs of the future • Discuss technologies and solutions currently in the market
  • 3. Big Data - Overview • Big Data 5 V’s: • Volume • Velocity • Variety • Veracity • Platform of today – set of relatively split up components • Data is stored on HDFS – File system • Catalogue of Data and its schema is maintained in another service – TBC! • Query front ends – Query engines based on different requirements => Value
  • 4. Platform Architecture – Modern Architecture DataSources Acquisition DataSystems Staging Zone ETL and data standardization Pristine Archive Compressed Gzip etc. Data Warehouse Immutable data Analytics Zone Allocated data changes Schema Catalogue Well-define reference to data structures and attributes Data Ledger Track data and its access with lineage and operations BigDataPlatform Data Marts UI/API Apps Source: Hortonworks
  • 5. Why do we need to manage metadata for Big data platforms? Large volumes of data landing in Hadoop/Big Data Growing users working with the data The need for effective control & consumption of Data The implementation needs to: • Offer good data visibility across your cluster • Capture data lineage across source systems and in the platform • Audit and record operations that are performed in the platform • Enforce policies that are defined by the platform stewards • Help reduce data redundancy on the platform Source: Cloudera
  • 7. Metadata – What is it? Data about Data! • Business Metadata  Supplies the business context around data, such as the business term’s name, definition, owners or stewards, and associated reference data • Technical Metadata  provides technical information about the data, such as the name of the source table, the source table column name, and the data type (e.g., string, integer) • Operational Metadata  furnishes information about the use of the data, such as date last updated, number of times accessed, or date last accessed Source: Informatica
  • 8. Why do I need all this metadata? • Data lake will contain all types of data – log streams - kafka, DBS – sqoop… don’t make your lake turn into a swamp! • Consistency of definitions - To reconcile the difference in terminology such as "clients" and "customers," "revenue" and "sales” • Clarity of data lineage – About origins of a data set and can be granular enough to define information at the attribute level, including operations on it • To understand data usage on your cluster • Optimize queries and views • Compliance and Regulatory • Compliance -Capture, store and move data – Sarbanes-Oxley, HIPAA, Basel II • Security - Authorization, Authentication – Handling sensitive data • Auditing - Recoding every attempt to access • Archive & Retention - Data life cycle policies Source: Teradata/Tech target
  • 9. Metadata System Architecture Topologically, metadata repository architecture defines one of the following three styles: • Centralized Metadata repository  Efficient access and adaptability, scalability and high performance  Single point of failure and continuous synchronization • Distributed Metadata repository  Access to metadata repo in real-time, up-to date metadata  Overhead in maintaining the configuration of the source system changes and HA • Federated or Hybrid Metadata repository  Central definition storage with references to the proper locations of the accurate definitions Source: Techtarget
  • 10. Use-cases for the need for Metadata
  • 11. Use Cases – Analytics 1. Finding the Data: Data Scientists spend a lot time finding the correct columns for variable selection • Around 80% of the data scientist’s time on column investigation with SMEs 2. Profile of Data: Reduce the number of time spent on data profiling by the ad-hoc queries • ~78% of the queries run on the cluster are profiling queries 3. Track the transformation: Data Scientists would like to understand how the data sets are derived • Not fully tracked except at a high level Source: Aetna
  • 12. 1. Finding the data: Challenges • Hive requires relatively manual traversal of the schema to find the table and columns • HDFS also requires traversal of the directory listing to find a file • Any documentation (external to the system) become outdated and are not always reliable • No simple way to add business metadata Source: Aetna
  • 14. 1. Finding the data: Solutions • Run-time capture of metadata of hive and HDFS, and store in repository • Provide an API to query the metadata and search across it • Provide an API or other ways to enrich the data with its business context Business Metadata Technical/Physical Metadata Hive HDFS Ingestion/ Sqoop Apache Atlas Source: Aetna
  • 15. 2. Profile of Data: Challenges and Solutions • Access to hive metastore will introduce latency in production • Lack of comprehensive information provided by the hive metastore 78% 18% 4% Average Daily Query Profiling Exploratory Production • Provide a system with business, technical data that are cross referenced • Have a framework for the data scientist to accommodate additional profiling Source: Aetna
  • 16. 3. Track the transformation: Challenges and Solutions • Documenting transformation is manual and difficult to scale • Mechanism for auditing data pipeline still lacking • Data quality and provenance is too manual • Leverage metadata already captured to construct transformations • Provide an API to query transformations • Provide a visualization for the transformations Source: Aetna
  • 17. What do we need? 1. A Searchable platform for all the data types for business and technical metadata 2. Data profile store with basic metrics of the data • Min • Max • Column distribution 3. Visual lineage for the data flow from the source system to different components within the platform • ETL operations – HL view • Analytics queries 4. Automated Metadata driven data ingestion and thus management • The Data Lake concept relies on capturing a robust set of attributes for every piece of content within the lake • Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking facility.
  • 19. Apache Atlas – deep dive • Apache Atlas Capabilities: Overview • Data Classification • Import or define taxonomy business-oriented annotations for data • Define, annotate, and automate capture of relationships between data sets • Export metadata to third-party systems • Centralized Auditing • Capture security access information • Capture the operational information for execution, steps, and activities • Search & Lineage (Browse) • Text-based search features locates relevant data and audit event across Data Lake quickly and accurately • Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information • Security & Policy Engine • Rationalize compliance policy at runtime based on data classification schemes Source: Hortonworks Open-source Incubator project
  • 21. Possible solutions for other platforms
  • 22. Netflix – Managing Data Platforms Source: Netflix
  • 23. Possible Solutions for other Platforms Metacat • Apply Metadata management on Service layer • Federated metadata catalog for the whole data platform • Proxy service to different metadata sources • Data metrics, data usage, ownership, categorization and retention policy … • Common interface for tools to interact with metadata Tracking Data Difference • Apply Metadata management on Service layer • Track the changes to documents/entities • Custom code tracking through logs collected as Mongo, or use a module called MongoID Netflix OSS
  • 24. Where Else? { "Description": "A containerized foobar", "Usage": "docker run --rm example/foobar [args]", "License": "GPL", "Version": "0.0.1-beta", "aBoolean": true, "aNumber" : 0.01234, "aNestedArray": ["a", "b", "c"] } <meta name=”description” content=”155 characters of message matching text with a call to action goes here”> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.JOSA.Meta</groupId> <artifactId>project</artifactId> <version>1.0</version> </project>
  • 25. Notes - Summary • Consider the different types of metadata you need to manage • Build a robust descriptive dictionary for the data • Manage metadata as a team effort. It has a lot of benefit so make it Agile but effective. Finally…remember that One’s Metadata – d/dx – is someone else’s Data!
  • 26. Resources • HDP 2.3 Preview Sandbox VM: (Hortonworks) – http://hortonworks.com/hdp/whats-new/ • Apache Atlas: – http://atlas.incubator.apache.org/ – http://incubator.apache.org/projects/atlas.html – https://git-wip-us.apache.org/repos/asf/incubator-atlas.gi • Metadata Management (General) – https://www.informatica.com/content/dam/informatica- com/global/amer/us/collateral/white-paper/metadata-management-data- governance_white-paper_2163.pdf

Editor's Notes

  1. How many of you use hadoop? and in production?
  2. Value and governance itself Predictive powers like entity resolution etc. Value related to cluster health Regulatory
  3. Find sources
  4. For the platform of today welle bg ba3ed hadoop I guess