This document summarizes a presentation by Joe Caserta on defining and applying data governance in today's business environment. It discusses the importance of data governance for big data, the challenges of governing big data due to its volume, variety, velocity and veracity. It also provides recommendations on establishing a big data governance framework and addressing specific aspects of big data governance like metadata, information lifecycle management, master data management, data quality monitoring and security.
Defining and Applying Data Governance in Today’s Business Environment
1. @joe_Caserta#dgconference
Defining and Applying Data Governance in
Today’s Business Environment
Joe Caserta
President
Caserta Concepts
December 8-12, 2014
The Westin Beach Resort
Ft. Lauderdale, Florida
2. @joe_Caserta#dgconference
Top 20 Big Data
Consulting - CIO Review
Joe Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Dedicated to Data Warehousing,
Business Intelligence since 1996
Began consulting database
programing and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published
in Intelligent Enterprise
Formalized Alliances / Partnerships –
System Integrators
Partnered with Big Data vendors
Cloudera, Hortonworks, IBM, Cisco,
Datameer, Basho more…
Launched Training practice, teaching
data concepts world-wide
Laser focus on extending Data
Warehouses with Big Data solutions
1986
2004
1996
2009
2001
2010
2013
Launched Big Data Warehousing
(BDW) Meetup-NYC ~1500 Members
2012
2014
Established best practices for big
data ecosystem implementation –
Healthcare, Finance, Insurance
Top 20 Most Powerful
Big Data consulting firms
Dedicated to Data Governance
Techniques on Big Data (Innovation)
3. @joe_Caserta#dgconference
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Cluster
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Mahout MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
Today’s business environment requires Big Data
Data Science
5. @joe_Caserta#dgconference
Why is Big Data Governance Important?
Convergence of
Data quality
Management and policies
All data in an organization
Set of processes
Ensures important data assets are formally managed throughout the
enterprise.
Ensures data can be trusted
People made accountable for low data quality
It is about putting people and technology in place to fix and
prevent issues with data so that the enterprise can become
more efficient.
6. @joe_Caserta#dgconference
•Data is coming in so
fast, how do we
monitor it?
•Real real-time
analytics
•What does
“complete” mean
•Dealing with sparse,
incomplete, volatile,
and highly
manufactured data.
How do you certify
sentiment analysis?
•Wider breadth of
datasets and sources
in scope requires
larger data
governance support
•Data governance
cannot start at the
data warehouse
•Data volume is
higher so the process
must be more reliant
on programmatic
administration
•Less people/process
dependence
Volume Variety
VelocityVeracity
The Challenges With Governing Big Data
7. @joe_Caserta#dgconference
What’s Old is New Again
Before Data Warehousing Data Governance
Users trying to produce reports from raw source data
No Data Conformance
No Master Data Management
No Data Quality processes
No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
Before Big Data Governance
We can put “anything” in Hadoop
We can analyze anything
We’re scientists, we don’t need IT, we make the rules
Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or
data governance will create a mess
Rule #2: Information harvested from an ungoverned systems will take us back to
the old days: No Trust = Not Actionable
8. @joe_Caserta#dgconference
Making it Right
The promise is an “agile” data culture where communities of users are encouraged
to explore new datasets in new ways
New tools
External data
Data blending
Decentralization
With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
We need more systemic administration
We need systems, tools to help with big data governance
This space is EXTREMELY immature!
Steps towards Big Data Governance
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to
governance
4. Establish a set of tools to make governing Big Data feasible
10. @joe_Caserta#dgconference
•This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from),
business definitions, technical metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
For Big Data
11. @joe_Caserta#dgconference
Big Data Governance Realities
Full data governance can only be applied to “Structured” data
The data must have a known and well documented schema
This can include materialized endpoints such as files or tables OR
projections such as a Hive table
Governed structured data must have:
A known schema with Metadata
A known and certified lineage
A monitored, quality test, managed process for ingestion and
transformation
A governed usage Data isn’t just for enterprise BI tools anymore
We talk about unstructured data in Hadoop but more-so it’s semi-
structured/structured with a definable schema.
Even in the case of unstructured data, structure must be
extracted/applied in just about every case imaginable before analysis
can be performed.
12. @joe_Caserta#dgconference
The Data Scientists Can Help!
Data Science to Big Data Warehouse mapping
Full Data Governance Requirements
Provide full process lineage
Data certification process by data stewards and business owners
Ongoing Data Quality monitoring that includes Quality Checks
Provide requirements for Data Lake
Proper metadata established:
Catalog
Data Definitions
Lineage
Quality monitoring
Know and validate data
completeness
13. @joe_Caserta#dgconference
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Governance Pyramid
Metadata Catalog
ILM who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata Catalog
ILM who has access, how long do we
“manage it”
Data Quality and Monitoring
Monitoring of completeness of data
Metadata Catalog
ILM who has access, how long do we “manage it”
Data Quality and Monitoring Monitoring of
completeness of data
Hadoop has different governance demands at each tier.
Only top tier of the pyramid is fully governed.
We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
14. @joe_Caserta#dgconference
People, Processes and Business commitment is still critical!
- (Incubating) promises many of the features
we need, however is fairly immature (Version 0.5).
Recommendation: Roll your own custom lifecycle management
workflow using Oozie + retention metadata
The Information Lifecycle Part of Big Data
Caution: Some Assembly Required
The V’s require robust tooling:
Unfortunately the toolset is pretty
thin: Some of the most hopeful tools
are brand new or in incubation!
Components like ILM have fair
tooling, others like MDM and Data
Quality are sparse
15. @joe_Caserta#dgconference
Master Data Management
Traditional MDM will do depending on your data size and
requirements:
Relational is awkward, extreme normalization, poor usability and
performance
NoSQL stores like HBase has benefits
If you need super high performance low millisecond response times to
incorporate into your Big Data ETL
Flexible Schema
Graph database is near perfect fit. Relationships and graph analysis bring
master data to life!
Data quality and matching processes are required
Little to no community or vendor support
More will come with YARN (more Commercial and Open Source IP
will be leveragable in Hadoop framework) -
Recommendation: Buy + Enhance or Build.
17. @joe_Caserta#dgconference
Staging
Library
Consolidated
Library
Standardization Matching
Integrated
Library
Survivorship
Source ID Name Home Address Birth Date SSN
SYS A 123 Jim Stagnitto 123 Main St 8/20/1959 123-45-6789
SYS B ABC J. Stagnitto 132 Main Street 8/20/1959 123-45-6789
SYS C XYZ James Stag NULL 8/20/1959 NULL
Source ID Name Home Address Birth Date SSN Std Name Std Addr MDM ID
SYS A 123 Jim Stagnitto 123 Main St 8/20/1959 123-45-6789 James Stagnitto 123 Main Street 1
SYS B ABC J. Stagnitto 132 Main Street 8/20/1959 123-45-6789 James Stagnitto 132 Main Street 1
SYS C XYZ James Stag NULL 8/20/1959 NULL James Stag NULL 1
MDM ID Name Home Address Birth Date SSN
1 James Stagnitto 123 Main Street 8/20/1959 123-45-6789
Mastering Customer and Provider Data
Validation
19. @joe_Caserta#dgconference
Graph Databases (NoSQL) to the Rescue
Hierarchical relationships are never
rigid
Relational models with tables and
columns not flexible enough
Neo4j is the leading graph database
Many MDM systems are going graph:
Pitney Bowes - Spectrum MDM
Reltio - Worry-Free Data for Life Sciences.
20. @joe_Caserta#dgconference
Securing Big Data
Determining Who Sees What:
Need to be able to secure as many data types as possible
Auto-discovery important!
Current products:
Sentry – SQL security semantics to Hive
Knox – Central authentication mechanism to Hadoop
Cloudera Navigator – Central security auditing
Hadoop - Good old *NIX permission with LDAP
Dataguise – Auto-discovery, masking, encryption
Datameer – The BI Tool for Hadoop
Recommendation: Assemble based on existing tools
21. @joe_Caserta#dgconference
• For now Hive Metastore, HCatalog + Custom might be best
• HCatalog gives great “abstraction” services
• Maps to a relational schema
• Developers don’t need to worry about data formats and
storage
• Can use SuperLuminate to get started
Recommendation: Leverage HCatalog + Custom metadata tables
Metadata
22. @joe_Caserta#dgconference
They gave
developers and data
scientists a reason to
use it:
• Easy to use storage
handlers
• Automatic partitioning
• Schema backwards
compatibility
• Monitoring and
dependency Checks
The Twitter Way
Twitter was suffering from a data science wild west.
Developed their own enterprise Data Access Layer (DAL)
23. @joe_Caserta#dgconference
Data Quality and Monitoring
To TRUST your information a robust set of tools for continuous
monitoring is needed
Accuracy and completeness of data must be ensured.
Any piece of information in the Big Data Warehouse must have
monitoring:
Basic Stats: source to target counts
Error Events: did we trap any errors during processing
Business Checks: is the metric “within expectations”, How
does it compare with an abridged alternate calculation.
Large gap in commercial projects /open source project offerings
24. @joe_Caserta#dgconference
Data Quality and Monitoring Recommendation
DQ
metadata
Hive
Pig
MR
Quality
Check
Builder
DQ
Notifier
and
Logger
DQ
Events and
Timeseries
Facts
DQ ENGINE
• BUILD a robust data quality
subsystem:
• HBase for metadata and error
event facts
• Oozie for orchestration
• Based on Data Warehouse ETL
Toolkit
25. @joe_Caserta#dgconference
Closing Thoughts – Enable the Future
Today’s business environment
requires the convergence of data
quality, data management, data
engineering and business policies.
Make sure your data can be trusted
and people can be held accountable
for impact caused by low data
quality.
Get experts to help calm the
turbulence… it can be exhausting!
Blaze new trails!
Polyglot Persistence – “where any decent
sized enterprise will have a variety of different
data storage technologies for different kinds of
data. There will still be large amounts of it
managed in relational stores, but increasingly
we'll be first asking how we want to manipulate
the data and only then figuring out what
technology is the best bet for it.”
-- Martin Fowler
We focused our attention on building a single version of the truth
We mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM.
We had a fairly restrictive set of tools for using the EDW data Enterprise BI tools It was easier to GOVERN how the data would be used.
Apache Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.