One of the major aspects when it comes to ingesting and processing is understanding how to bring data together. Today over 70% of data analytics is actually spent in cleaning and parsing the data so value can be derived from it. But this is not trivial due to the large volumes of datasets we deal with. This talk will go over what it takes to understand how we can setup data governance principles that can be adhered by everyone in order to create a better & quicker analytics system.
Data Science Salon 2018 - Building a true enterprise data governance platform for the modern machine learning era by Subash D'Souza
1. BUILDING A TRUE
ENTERPRISE DATA
GOVERNANCE PLATFORM
FOR THE MODERN MACHINE
LEARNING ERA
Subash DSouza
Director Big Data & Ops,
Warner Bros – Data
Intelligence Group
2. BIO
Organizer - Data Con LA formerly known as Big Data
Day LA
Organizer - LA Big Data Users Group
Organizer - LA Apache Spark Users Group
Advisor - Anthill Studio
Investor – FOSSA
Data Governance PMC Member – The Linux Foundation
– ODPi
Founder – Archangel Technology Consultants LLC
4. ENTERPRISE DATA GOVERNANCE
CHALLENGES
4
Poor data quality
costs real money
1
Process efficiency is
negatively impacted
by poor data
governance
2
Full potential benefits
of new systems not
be realized because
of poor data
governance
3
Decision making is
negatively affected by
poor data governance
4
5. ENTERPRISE DATA GOVERNANCE
OBJECTIVES
5
Guide information management decision-makingGuide
Ensure information is consistently defined and well understoodEnsure
Increase the use and trust of data as an organization assetIncrease
Improve consistency of projects across the organizationImprove
Ensure regulatory complianceEnsure
Eliminate data risksEliminate
6. MASTER DATA
MANAGEMENT
PROBLEMS
6
Discovery - cannot find the right information
Integration - cannot manipulate and combine
information
Dissemination - cannot consume information
Insight - cannot extract value and knowledge from
information
Management – cannot manage and control
information volumes and growth
7. MASTER DATA
MANAGEMENT
ISSUES
7
52% of users don’t have
confidence in their
information
59% of managers miss
information they should
have used
42% of managers use
wrong information at
least once a week
75% of CIOs believe they
can strengthen their
competitive advantage
by better using and
managing enterprise
data
78% of CIOs want to
improve the way they
use and manage their
data
Only 15% of CIOs
believe that their data is
currently
comprehensively well
manage
8. META DATA
MANAGEMENT
8
Metadata needs to be managed to ensure ...
Availability: metadata
needs to be stored where it
can be accessed and
indexed so it can be
found.
Quality: metadata needs to
be of consistent quality so
users know that it can be
trusted.
Persistence: metadata
needs to be kept over
time.
Open License: metadata
should be available under
a public domain license to
enable its reuse.
Metadata is structured information that describes,
explains, locates, or otherwise makes it easier to
retrieve, use, or manage an information resource.
Metadata is often called data about data or
information about information.
9. META DATA
MANAGEMENT
LIFECYCLE
The metadata lifecycle is larger than the data
lifecycle:
Metadata may be created before data is created or captured,
e.g. to inform about data that will be available in the future.
Metadata needs to be kept after data has been removed, e.g. to
inform about data that has been decommissioned or
withdrawn.
9
14. 14
DATA SECURITY & COMPLIANCE
PII, GDPR, Retention, Oversight, Reporting
DATA CATALOG
Data Discovery, Classification, AI, ML
DATA QUALITY
Data Cleansing, Data Completeness
DATA LINEAGE
Attribution, Survivorship, Feedback Loops
METADATA MANAGEMENT
Data Tagging, Federated Search
MDM (MASTER DATA MANAGEMENT)
Taxonomy, Single Version of the Truth
DATA GOVERNANCE PROJECT / SERVICE SCOPE
Value add service for data security, retention and
compliance
Crawl, introspect, discover and classify data
Clean, correct, validate and triage data
Track and record data source, path and life cycle
Manage searchable properties associated with data
Service for the ingestion, cleansing and curation of
data
15. BUSINESS VALUE ADD
15
Revenue Agility
Cost Compliance
• Reduce risk
• Control access to data
• Adhere to government and corporate
regulations
• Manage customer privacy preferences
• Automate manual business processes
• Reduce data errors
• Eliminate duplication of efforts and
technologies
• IT system consolidation initiatives
• Recognize new business opportunities
• Increase business reach by identifying
synergy
• Improve marketing capabilities
• Faster time to market
• Consolidate data from silos
• Meet demands of new business channels
• Grow with the business
• Identify key relationships and hierarchies
16. PROPOSED SOLUTION
16
Metadata Management
Data Lineage
Data Catalog
Data Security
Data Quality
Master Data Management
RulesEngine
Workflow&Audit
Applications
Data Data
SERVICE ORIENTED ARCHITECTURE
Communicating Truth to Others
SOA Patterns:
● Service Provider
● Service Broker, Registry, Repository
● Service Requester / Recipient
● Incoming & outgoing Services
Defining Concepts:
● Business Value
● Strategic Goals
● Intrinsic Interoperability
● Shared Services
● Flexibility
● Evolutionary Refinement
Orchestrated Service Types:
● Push & Pull
● RESTful Services
● SOAP Services
● Database Connectivity
● Flat File Support
17. DETAILED SERVICE OFFERING
ARCHITECTURE
17
Rules Engine
SOA Service Oriented Architecture
Auditing
InfrastructureNetwork Storage Messaging Transport
Data
Catalog.
Data
Classify
Data
Discovery
Data
Quality
Quality
Cleansing
Data
Lineage
Data
Source
Workflow
Feedback
Loop
Metadata
Mgmt.
Mgmt.
Strategy
Metadata
Storage
Metadata
Capture
Master Data
Mgmt.
Data
Architectur
e
Data
Quality
Taxonomy
Distribution
FTP
HTTP
TCP / UDP
DATABASE
Data
Security
B2B / Internal Users B2C / External Users
Web
Services
Search
Internal Up Stream Systems External Down Stream Systems User Interfaces
Complianc
e
Encryption
Obfuscate
/
Anonymize
Access
Controle
Business Process Orchestration
Tag
FILE
PII / GDPR
Service
Registries
Messaging
Document
Logical
Ordering
Publish
Complete
Accuracy
Consistenc
y
Metadata
Integration
Business
Intelligence
Distributio
n &
Feedback
ProcessingDatabase
18. MDM USE CASE – DATA
CLEANSING
18
Upstream/SourceSystems
MDM Components
MDM Master Data Curation
Review,
Curate &
Approve
Clean MDM Master
DB Exports
(as needed)
DB ETL Feed
(as needed)
MDM Curators MDM Users
(Optional)
Data
Merge
MDM Users
Source
System 1
Source
System 2
Source
System
N
Target
System 1
MDM
Portal
(Optional)
MDM DATA EXPOLITATION
Adding Business Value
ISDATACLEAN???
AUTOMATED DATA
CLEANSING /
TRIAGE
Data Support
Staff
Technical
Support Staff
MANUAL DATA
CLEANSING /
TRIAGE
Unresolved
ETL
ETL
RESOLVED RESOLVED
Is Data
Clean
Automate
Technical
Cleanse
Logical Structural
IS DATA CLEAN & RESOLVED?
Structural Data
Target
System
N
RETURN CLEAN DATA BACK TO SOURCE SYSTEMS VIA SOA SERVICES
MicroservicesData
Staging
ETL