SlideShare a Scribd company logo
1 of 34
Deploying a Governed Data Lake
2
Everyone needs data to make better decisions
3
A data lake
http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml
“Size and low cost”
“Fidelity: Hadoop data
lakes preserve data in its
original form”
“Ease of accessibility:
Accessibility is easy in
the data lake”
“Late binding: Hadoop
lends itself to flexible,
task-oriented structuring
and does not require up-
front data models”
“Nearly unlimited potential for operational insight
and data discovery. As data volumes, data
variety, and metadata richness grow, so does the
benefit.”
4
Data warehouse vs. data lake
Data Warehouse
• Production system
• Well-defined usage
• Well-defined schema
• Clean, trusted data
• Heavy IT reliance
– Less technical analysts
– Large IT teams: DBAs,
Data Architects, ETL
Developers, BI
Developers, DQ
Developers, Data
Modelers, Data Stewards
Data Lake
• Non-production system
• Future, experimental usage
• No schema (schema on read)
• Raw data, frictionless ingestion
• Self-service
– More technical analysts
– IT manages the cluster and ingestion,
but no IT involvement when working with
data
5
as the platform for a scalable
data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
• Lots of data (Volume): cost-effective storage and
scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to insure
future-proofing
6
Is Hadoop enough?
Big Data Architect
Hadoop
We have
Hadoop, now
what?
10-20 nodes
7
Big Data Architect
Hadoop
How do I get
the business to
start using it?
Data Scientist/Business
Analyst
10-20 nodes
8
Big Data Architect
Hadoop
How do I get
the business to
start using it?
Data Scientist/Business
Analyst
How do I find
and understand
data easily to
do big data
analytics?
Self-service
10-20 nodes
9
Big Data Architect
Hadoop
Data Scientist/Business
Analyst
No security and
governance
10-20 nodes
Risk/Data Governance
Executive How do I ensure
compliance with
regulations and
data policies ?
Sensitive data?
10
Big Data Architect
Hadoop
How do I
scale?
Data Scientist/Business
Analysts
100s/1000s of nodes
Manual process to catalog the lake can’t scale
11
• Lots of data (Volume): cost-effective storage
and scalable processing
• Flexibility to handle all kinds of data (Variety)
• Will be around for a long time: modularity to
insure future-proofing
• Self-service to help users find, understand
and use the data
• Governance to protect sensitive data,
document lineage and asses quality
The platform for a scalable
data lake infrastructure
✔ Hadoop
✔ Hadoop
✔ Hadoop
X Hadoop
X Hadoop
12
Waterline Data Inventory broadens Hadoop
adoption through governed self-service
Big Data Architect
Hadoop
Data
Scientist/Business
Analyst
100s/1000s of nodes
Risk/Data
Governance
Executive
Self-service Security and
governance
Massive scale
13
3-phase approach to a governed data lake
Organize
the lake
Inventory
the lake
Open up
the lake
14
Organize the lake into zones
Organize
the lake
15
Establish access control per zone
• Business Analysts
• Data Scientists
• Data Scientists
• Data Engineers
• Data Scientists
• Data Engineers
• Data Stewards
Sensitive Landing
GoldWork
Organize
the lake
16
The governed data lake
Data Scientist/Business Analyst Data Steward Big Data Architect
HDFS Hive
Waterline Data Inventory
Find/understand Govern
Governed
data layer
Governance
Inventory
Self-Service
17
Metadata Curation
Self-Service Catalog/Provisioning
Big Data Architect
Find/understand
Governed
data layer
Data Scientist/Business Analyst
The governed data lake
Data Steward
HDFS Hive
Waterline Data Inventory
Govern
Inventory
Inventory
the lake
Profile and discover
the content of files
and Hive tables
18
Inventory
Parse multiple
content types
Create catalog
automatically
Discover lineage
automatically
19
Self-Service Catalog/Provisioning
Big Data Architect
Find/understand
Governed
data layer
Data Scientist/Business Analyst
The governed data lake
Data Steward
HDFS Hive
Waterline Data Inventory
Govern
Inventory
Govern
the lake
Governance
• Inspect files and perform
tag curation
• Identify sensitive data
• Assess data quality
• Discover data lineage
• Manage glossary
20
Navigate Lineage of Files in Hadoop
Clickable, navigable
lineage discovered using
file content or imported
from other tools through
REST APIs
21
Automated Data Profiling Helps with Quality
Assessment
Infographic shows
contents at a glance:
• Different types of data in
the same field
• Number of missing
values
Separate profiles for each data
type including number of unique
values (cardinality), uniqueness
(selectivity) and type-specific
measures like mean and
standard deviation for numbers
22
Data Preview and Visualization Helps
Understand the Data
Visualization helps
understand the shape
and distribution of data
Most frequent values for
each field
23
Discover Sensitive Data
Screen shot
Find all fields that
may have SSN
24
Curate Discovered Sensitive Data Fields
Curate the field and
accept or reject the tag
25
Manage Glossary
Import or create a
business glossary
Manage tags
26
View and search history
Screenshot of history tab
Another screenshot of searching
history (made up)
Data Inventory keeps
track of all user tagging,
schema changes, lineage
changes in Audit History
27
Data Steward
Govern
Big Data Architect
Governed
data layer
Open up the data lake
HDFS Hive
Waterline Data Inventory
Inventory
Governance
Self-Service
Find/understand
Data Scientist/Business Analyst
Explore catalog
and provision
data securely
Open up
the lake
28
Find and Understand
Automatically propagate user-
defined tags (crowdsource ontology)
Discover meaning of fields and
tag automatically
Multi-faceted
drill down
Automated facet creation
based on metadata
Business metadata-based search
29
Annotate fields, files and folders with tags
• Analysts can tag fields and files
with meaningful business tags
• Type-ahead shows existing
available tags that match the
typed string
• Users can choose one or create
a new tag
• Period in tag name automatically
creates tag hierarchy (e.g.,
Restaurant.Name creates
category “Restaurant” and tag
“Name”
30
Based on a single field in one file tagged as
Restaurant.Name, Waterline Data Inventory
discovery engine found 25 additional instances of
Restaurant Name automatically.
User assigned tags are
solid blue
Automatically suggested
tags are faded blue with
confidence level
Delimited files
don’t have
field names
Waterline Data Inventory learns from analysts who manually
tag fields and automatically finds and tags similar fields
31
Create Hive tables
Screen shot of file with “Generate Hive Table” option selected
- Replace Hive with Drill
Generate Hive
Tables
32
33
Company overview
• Headquartered in Mountain View, CA
• Funded in 2013 by Menlo Ventures and Sigma West
• Management Team:
Alex Gorelik,
Founder, CEO
Founded Exeros
(IBM) and Acta
(SAP), IBM DE,
Informatica GM.
Columbia BSCS,
Stanford MSCS.
Oliver Claude,
Marketing
VP SAP, VP
Informatica, IBM,
Siebel. Nova
Southeastern MS
MIS.
Jason Chen,
Engineering
VP Teradata, Acta,
Sybase. USC PhD
CS.
Ravi
Ramachandran,
Sales
CSC-Infochimps Big
Data, AppLabs,
Xchanging,
Pegasystems.
Scient (Razorfish)
WATERLINE DATA NAMED COOL VENDOR
Gartner, Cool Vendors in Information Governance
and MDM, 2015
Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill
O'Kane, Andrew White
Visit our exhibit in the ballroom to get
more information

More Related Content

What's hot

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDataWorks Summit
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...DataWorks Summit/Hadoop Summit
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data LakeVMware Tanzu
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lakeBHASKAR CHAUDHURY
 
Developing a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceDeveloping a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceTony Baer
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data ArchitectureZaloni
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryDataWorks Summit/Hadoop Summit
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success DataWorks Summit/Hadoop Summit
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...NoSQLmatters
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecturemark madsen
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architectureMilos Milovanovic
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...DataWorks Summit/Hadoop Summit
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureZaloni
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesDenodo
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake ArchitectureDATAVERSITY
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeDataWorks Summit
 

What's hot (20)

Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the...
 
10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake10 Amazing Things To Do With a Hadoop-Based Data Lake
10 Amazing Things To Do With a Hadoop-Based Data Lake
 
Traditional data warehouse vs data lake
Traditional data warehouse vs data lakeTraditional data warehouse vs data lake
Traditional data warehouse vs data lake
 
Developing a Strategy for Data Lake Governance
Developing a Strategy for Data Lake GovernanceDeveloping a Strategy for Data Lake Governance
Developing a Strategy for Data Lake Governance
 
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Creating a Modern Data Architecture
Creating a Modern Data ArchitectureCreating a Modern Data Architecture
Creating a Modern Data Architecture
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
Alexandre Vasseur - Evolution of Data Architectures: From Hadoop to Data Lake...
 
Building the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architectureBuilding the Enterprise Data Lake: A look at architecture
Building the Enterprise Data Lake: A look at architecture
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
Using Machine Learning to Capture Data Meaning and Wrangle it to Liberate its...
 
Data Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data ArchitectureData Lakes - The Key to a Scalable Data Architecture
Data Lakes - The Key to a Scalable Data Architecture
 
Big Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data LakesBig Data: Architecture and Performance Considerations in Logical Data Lakes
Big Data: Architecture and Performance Considerations in Logical Data Lakes
 
Data Preparation of Data Science
Data Preparation of Data ScienceData Preparation of Data Science
Data Preparation of Data Science
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
Big Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short TimeBig Data at Geisinger Health System: Big Wins in a Short Time
Big Data at Geisinger Health System: Big Wins in a Short Time
 

Viewers also liked

Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogMSAdvAnalytics
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWSAmazon Web Services
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaIgor Sfiligoi
 
Introduction to security in the Open Science Grid - OSG School 2014
Introduction to security in the Open Science Grid - OSG School 2014Introduction to security in the Open Science Grid - OSG School 2014
Introduction to security in the Open Science Grid - OSG School 2014Igor Sfiligoi
 
Quatre experiments de Física amb làser per a 2n de Batxillerat
Quatre experiments de Física amb làser per a 2n de BatxilleratQuatre experiments de Física amb làser per a 2n de Batxillerat
Quatre experiments de Física amb làser per a 2n de BatxilleratMarcel Jorba
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionMurtaza Doctor
 
HSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaHSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaIgor Sfiligoi
 
SOA Pattern Event Driven Messaging
SOA Pattern Event Driven MessagingSOA Pattern Event Driven Messaging
SOA Pattern Event Driven MessagingWSO2
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopCaserta
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalCaserta
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastoreKishore Gopalakrishna
 
Data Driven Decisions - Big Data Warehousing Meetup, FICO
Data Driven Decisions - Big Data Warehousing Meetup, FICOData Driven Decisions - Big Data Warehousing Meetup, FICO
Data Driven Decisions - Big Data Warehousing Meetup, FICOCaserta
 

Viewers also liked (20)

Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Augmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with NirvanaAugmenting Big Data Analytics with Nirvana
Augmenting Big Data Analytics with Nirvana
 
Introduction to security in the Open Science Grid - OSG School 2014
Introduction to security in the Open Science Grid - OSG School 2014Introduction to security in the Open Science Grid - OSG School 2014
Introduction to security in the Open Science Grid - OSG School 2014
 
Big Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS CloudBig Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS Cloud
 
Quatre experiments de Física amb làser per a 2n de Batxillerat
Quatre experiments de Física amb làser per a 2n de BatxilleratQuatre experiments de Física amb làser per a 2n de Batxillerat
Quatre experiments de Física amb làser per a 2n de Batxillerat
 
Big Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to ActionBig Data at Tube: Events to Insights to Action
Big Data at Tube: Events to Insights to Action
 
Big Data Applications
Big Data ApplicationsBig Data Applications
Big Data Applications
 
HSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaHSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and Nirvana
 
EasyHSM Overview
EasyHSM OverviewEasyHSM Overview
EasyHSM Overview
 
SOA Pattern Event Driven Messaging
SOA Pattern Event Driven MessagingSOA Pattern Event Driven Messaging
SOA Pattern Event Driven Messaging
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
Data Driven Decisions - Big Data Warehousing Meetup, FICO
Data Driven Decisions - Big Data Warehousing Meetup, FICOData Driven Decisions - Big Data Warehousing Meetup, FICO
Data Driven Decisions - Big Data Warehousing Meetup, FICO
 

Similar to Deploy a Governed Data Lake with Waterline Data Inventory

ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?James Serra
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentationmlang222
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos MilovanovicInstitute of Contemporary Sciences
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)Moacyr Passador
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedDunn Solutions Group
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lakesambiswal
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015Craig Jordan
 

Similar to Deploy a Governed Data Lake with Waterline Data Inventory (20)

ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Is the traditional data warehouse dead?
Is the traditional data warehouse dead?Is the traditional data warehouse dead?
Is the traditional data warehouse dead?
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)How to Quickly and Easily Draw Value  from Big Data Sources_Q3 symposia(Moa)
How to Quickly and Easily Draw Value from Big Data Sources_Q3 symposia(Moa)
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Enterprise Data Lake
Enterprise Data LakeEnterprise Data Lake
Enterprise Data Lake
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015American family hadoop journey, uw ebc sig meeting, april 2015
American family hadoop journey, uw ebc sig meeting, april 2015
 

Recently uploaded

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfCionsystems
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....ShaimaaMohamedGalal
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software DevelopersVinodh Ram
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 

Recently uploaded (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Active Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdfActive Directory Penetration Testing, cionsystems.com.pdf
Active Directory Penetration Testing, cionsystems.com.pdf
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Clustering techniques data mining book ....
Clustering techniques data mining book ....Clustering techniques data mining book ....
Clustering techniques data mining book ....
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Professional Resume Template for Software Developers
Professional Resume Template for Software DevelopersProfessional Resume Template for Software Developers
Professional Resume Template for Software Developers
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 

Deploy a Governed Data Lake with Waterline Data Inventory

  • 2. 2 Everyone needs data to make better decisions
  • 3. 3 A data lake http://www.pwc.com/us/en/technology-forecast/2014/issue1/features/data-lakes.jhtml “Size and low cost” “Fidelity: Hadoop data lakes preserve data in its original form” “Ease of accessibility: Accessibility is easy in the data lake” “Late binding: Hadoop lends itself to flexible, task-oriented structuring and does not require up- front data models” “Nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.”
  • 4. 4 Data warehouse vs. data lake Data Warehouse • Production system • Well-defined usage • Well-defined schema • Clean, trusted data • Heavy IT reliance – Less technical analysts – Large IT teams: DBAs, Data Architects, ETL Developers, BI Developers, DQ Developers, Data Modelers, Data Stewards Data Lake • Non-production system • Future, experimental usage • No schema (schema on read) • Raw data, frictionless ingestion • Self-service – More technical analysts – IT manages the cluster and ingestion, but no IT involvement when working with data
  • 5. 5 as the platform for a scalable data lake infrastructure ✔ Hadoop ✔ Hadoop ✔ Hadoop • Lots of data (Volume): cost-effective storage and scalable processing • Flexibility to handle all kinds of data (Variety) • Will be around for a long time: modularity to insure future-proofing
  • 6. 6 Is Hadoop enough? Big Data Architect Hadoop We have Hadoop, now what? 10-20 nodes
  • 7. 7 Big Data Architect Hadoop How do I get the business to start using it? Data Scientist/Business Analyst 10-20 nodes
  • 8. 8 Big Data Architect Hadoop How do I get the business to start using it? Data Scientist/Business Analyst How do I find and understand data easily to do big data analytics? Self-service 10-20 nodes
  • 9. 9 Big Data Architect Hadoop Data Scientist/Business Analyst No security and governance 10-20 nodes Risk/Data Governance Executive How do I ensure compliance with regulations and data policies ? Sensitive data?
  • 10. 10 Big Data Architect Hadoop How do I scale? Data Scientist/Business Analysts 100s/1000s of nodes Manual process to catalog the lake can’t scale
  • 11. 11 • Lots of data (Volume): cost-effective storage and scalable processing • Flexibility to handle all kinds of data (Variety) • Will be around for a long time: modularity to insure future-proofing • Self-service to help users find, understand and use the data • Governance to protect sensitive data, document lineage and asses quality The platform for a scalable data lake infrastructure ✔ Hadoop ✔ Hadoop ✔ Hadoop X Hadoop X Hadoop
  • 12. 12 Waterline Data Inventory broadens Hadoop adoption through governed self-service Big Data Architect Hadoop Data Scientist/Business Analyst 100s/1000s of nodes Risk/Data Governance Executive Self-service Security and governance Massive scale
  • 13. 13 3-phase approach to a governed data lake Organize the lake Inventory the lake Open up the lake
  • 14. 14 Organize the lake into zones Organize the lake
  • 15. 15 Establish access control per zone • Business Analysts • Data Scientists • Data Scientists • Data Engineers • Data Scientists • Data Engineers • Data Stewards Sensitive Landing GoldWork Organize the lake
  • 16. 16 The governed data lake Data Scientist/Business Analyst Data Steward Big Data Architect HDFS Hive Waterline Data Inventory Find/understand Govern Governed data layer Governance Inventory Self-Service
  • 17. 17 Metadata Curation Self-Service Catalog/Provisioning Big Data Architect Find/understand Governed data layer Data Scientist/Business Analyst The governed data lake Data Steward HDFS Hive Waterline Data Inventory Govern Inventory Inventory the lake Profile and discover the content of files and Hive tables
  • 18. 18 Inventory Parse multiple content types Create catalog automatically Discover lineage automatically
  • 19. 19 Self-Service Catalog/Provisioning Big Data Architect Find/understand Governed data layer Data Scientist/Business Analyst The governed data lake Data Steward HDFS Hive Waterline Data Inventory Govern Inventory Govern the lake Governance • Inspect files and perform tag curation • Identify sensitive data • Assess data quality • Discover data lineage • Manage glossary
  • 20. 20 Navigate Lineage of Files in Hadoop Clickable, navigable lineage discovered using file content or imported from other tools through REST APIs
  • 21. 21 Automated Data Profiling Helps with Quality Assessment Infographic shows contents at a glance: • Different types of data in the same field • Number of missing values Separate profiles for each data type including number of unique values (cardinality), uniqueness (selectivity) and type-specific measures like mean and standard deviation for numbers
  • 22. 22 Data Preview and Visualization Helps Understand the Data Visualization helps understand the shape and distribution of data Most frequent values for each field
  • 23. 23 Discover Sensitive Data Screen shot Find all fields that may have SSN
  • 24. 24 Curate Discovered Sensitive Data Fields Curate the field and accept or reject the tag
  • 25. 25 Manage Glossary Import or create a business glossary Manage tags
  • 26. 26 View and search history Screenshot of history tab Another screenshot of searching history (made up) Data Inventory keeps track of all user tagging, schema changes, lineage changes in Audit History
  • 27. 27 Data Steward Govern Big Data Architect Governed data layer Open up the data lake HDFS Hive Waterline Data Inventory Inventory Governance Self-Service Find/understand Data Scientist/Business Analyst Explore catalog and provision data securely Open up the lake
  • 28. 28 Find and Understand Automatically propagate user- defined tags (crowdsource ontology) Discover meaning of fields and tag automatically Multi-faceted drill down Automated facet creation based on metadata Business metadata-based search
  • 29. 29 Annotate fields, files and folders with tags • Analysts can tag fields and files with meaningful business tags • Type-ahead shows existing available tags that match the typed string • Users can choose one or create a new tag • Period in tag name automatically creates tag hierarchy (e.g., Restaurant.Name creates category “Restaurant” and tag “Name”
  • 30. 30 Based on a single field in one file tagged as Restaurant.Name, Waterline Data Inventory discovery engine found 25 additional instances of Restaurant Name automatically. User assigned tags are solid blue Automatically suggested tags are faded blue with confidence level Delimited files don’t have field names Waterline Data Inventory learns from analysts who manually tag fields and automatically finds and tags similar fields
  • 31. 31 Create Hive tables Screen shot of file with “Generate Hive Table” option selected - Replace Hive with Drill Generate Hive Tables
  • 32. 32
  • 33. 33 Company overview • Headquartered in Mountain View, CA • Funded in 2013 by Menlo Ventures and Sigma West • Management Team: Alex Gorelik, Founder, CEO Founded Exeros (IBM) and Acta (SAP), IBM DE, Informatica GM. Columbia BSCS, Stanford MSCS. Oliver Claude, Marketing VP SAP, VP Informatica, IBM, Siebel. Nova Southeastern MS MIS. Jason Chen, Engineering VP Teradata, Acta, Sybase. USC PhD CS. Ravi Ramachandran, Sales CSC-Infochimps Big Data, AppLabs, Xchanging, Pegasystems. Scient (Razorfish) WATERLINE DATA NAMED COOL VENDOR Gartner, Cool Vendors in Information Governance and MDM, 2015 Guido DeSimoni, Roxane Edjlali, Saul Judah, Bill O'Kane, Andrew White
  • 34. Visit our exhibit in the ballroom to get more information