This presentation starts off by discussing powerful examples of The Power of Data and the benefits of Data Driven architectures. A Data Governance program is important for the success of Data Driven architectures. We then discuss the challenges of implementing a Data Governance framework on a Big Data Data Lake with open source software including DataPlane, Apache Atlas and Apache Ranger. And finally, we discuss the importance of the democratization of data and the switching to a speed of thought framework with Hive LLAP.
2. About me
BI, Data Warehousing and Big Data Evangelist since 1983.
Before joining Ultimate, I was Chief Big Data Architect at Visa and before that I was
VP of Data Architecture at Fidelity Investments
My first job was with Bob Earle “The father of OLAP”.
I worked in the Finance group at Coors Brewing Company where we created some
of the first data warehouses.
I have given many presentations at IOUW, RMOUG, TDWI, Collaborate, Gartner
Group, Oracle Open World and the BI Summit. Also, I have given Metadata and
Data Governance presentations for HIMSS.
I have a degree in statistics, MBA in Finance and Masters of Computer Science.
I have authored Oracle Essbase & Oracle OLAP: A Guide to Oracle’s
Multidimensional Solutions Published by Oracle Press and Oracle Data
Warehousing published by SAMS.
2
3. Agenda
• Challenge
• Analytics – What is it?
• The Power of Data
• Data Governance
• Solutions
• The Data Lake – Cloudera – HDP 3.1
• LLAP and Vectorization
• DataPlane – Ranger and Atlas – DSS and DLM
3
Good judgment comes from
experience, and a lot of that comes
from bad judgment." - Will Rogers
4. You keep using that word. I
do not think it means what
you think it means.
What do you mean by “analytics”?
Challenge – Analytics and Data Governance
5. There are two parts to
“analytics”
The mathy stuff The query & reporting stuff
8. Tesla and LinkedIn Think Resumes Are Overrated.
They Use Neuroscience-Based Games Instead
www.inc.com/kevin-j-ryan/pymetrics-replacing-resumes-with-brain-games.html
That's the philosophy touted by Frida Polli, co-founder and CEO of hiring startup
Pymetrics. The company makes games meant to determine whether a candidate
would be a good fit in a specific role at your company. Polli says that so far, the
platform has been more effective at finding the right hires than traditional resumes.
The results have been promising. Polli says that some companies have more than
doubled the percentage of candidates they hire out of those they invite for in-person
interviews. One-year retention rates have increased by between 30 and 60 percent.
And companies are reporting that job performance has improved among newly hired
candidates.
11. What is a Data Lake?
11
A single place to store every type of data in its native format with no fixed limits on account size or file
size, high throughput to increase analytic performance and native integration with the Hadoop
ecosystem.
An architectural shift in the BI World that uses Hadoop to deliver deep insight across a large,
broad, diverse set of data at efficient scale.
15. Find Any Business Data in Sub-second
Each CPU scans
local in-memory
columns
Scans use super
fast SIMD vector
instructions
Billions of
rows/sec scan rate
per CPU core
16. May 25,
2018
GDPR – What is it?
4%
Or
€20MPotential Penalty
Per Infraction
Global
Impact
5 Key General Data Protection Regulation Obligations
Rights of EU
Data
Subjects
Security of
Personal
Data
Consent Accountability of
Compliance
Data Protection by
Design and by
Default
www.eugdpr.org
17. Access
Defining what
users and
applications can
do with data
Technical concepts:
Data Policies
Authorization
Data Protection
Protecting data in
the cluster from
unauthorized
visibility
Technical concepts:
Encryption,
tokenization, data
masking
Visibility
Reporting on
where data came
from and how it’s
being used
Technical concepts:
Auditing
Lineage
Knox
Identity
Guarding
access to the
cluster itself
Technical concepts:
Authentication
Network
isolation
Pillars of our comprehensive Data
Governance Solution
Discovery
Finding Data
Assets and
Definitions
Technical concepts:
Business Glossary,
Technical Glossary
and Search.
18. Access
Defining what
users and
applications can
do with data
Technical concepts:
Data Policies
Authorization
Data Protection
Protecting data in
the cluster from
unauthorized
visibility
Technical concepts:
Encryption,
tokenization, data
masking
Visibility
Reporting on
where data came
from and how it’s
being used
Technical concepts:
Auditing
Lineage
Knox
Ranger DataPlane & Atlas
Hardware, File and
Column Encryption
Identity
Guarding
access to the
cluster itself
Technical concepts:
Authentication
Network
isolation
Pillars of our comprehensive Data
Governance Solution
Discovery
Finding Data
Assets and
Definitions
Technical concepts:
Business Glossary,
Technical Glossary
and Search.
DataPlane & AtlasKnox/Active
LDAP Kerwberos
19. ACCESS - Establish and Implement Data Policies
▪ Accomplish: Manage and automate the information lifecycle from ingestion to purge, cradle to
grave, based on the unified metadata catalog
- Role Based Authorization
- Allow an Analyst to see PII data but not Developer
- Allow for Masking of Data
- Allow for automate enforcement of Data Retention Policies such as 7 days in Kakfa
20. Dynamic Row Filtering & Column Masking: Apache Ranger with Apache Hive
User 2: Ivanna
Location : EU
Group: HRUser 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National ID CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx xxxx null John Doe
US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic Ranger
Policies: Filter rows by region & apply
relevant column masking
Users from US Analyst group see data for
US persons with CC and National ID (SSN)
as masked values and MRN is nullified
Country National ID Name MRN
Germany T22000129 Ernie
Schwarz
876452830A
EU HR Policy Admins can see
unmasked but are restricted by
row filtering policies to see data
for EU persons only
Original Query:
SELECT country, nationalid,
name, mrn FROM
ww_customers
Analysts
HR Marketing
21. Visiability - Apache Ranger Audits - Data Access
⬢ Comprehensive scalable audit logging
⬢ Audits for:
⬢ Resource Access Events with user context
⬢ Policy Edits/Creation/Deletion
⬢ User session information
⬢ Component plugin policy sync operations
22. Tag (Classification) Based Masking
Masking Policy
For any Hive columns tagged as containing PII:
• Allow HR to see data in the clear for any type of
PII
• Apply ‘Nullify’ mask to columns classified as
type ‘MRN’ for Analysts
• Apply ‘Hash’ as masking option to columns
classified as type ‘Password’
23. CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
24. CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
25. CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for
distribution of column values
Data Steward Studio (DSS)
DataPlane DSS - Understanding
26. CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – Where is my PII?
27. CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – What Tables are Accessed?
28. CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – PII Trends
29. CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DSS – Data Lineage
30. CONSUMABILITY: Understand shape of Hive
column data with statistical profiler, example:
Profile shows box plot and histogram for ditribution
of column values
DataPlane DLM as Backup and DR