Data Lake - Multitenancy Best Practices

This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Data Lake – Multitenancy Best Practices
30 November, 2018 | Author: Sanjay Upadhyay; Sr. Solution Architect
CitiusTech Thought
Leadership

2
Objective
 Multitenancy in a data lake allows organizations to share their cluster resources across user
communities without impacting business SLAs and capabilities or security and privacy needs
 This document covers guidelines around achieving multitenancy in a data lake environment
 It mentions the different design and implementation guidelines necessary for on premise as well
as cloud-based multitenant data lake, and highlights the reference architecture for both these
deployment options

3
Agenda
 Introduction
 Key Drivers for Data Lake Multitenancy
 On-Premise Data Lake Multitenancy
• Design Considerations
• Implementation Guidelines
• Reference Architecture
 On-Cloud Data Lake Multitenancy
• Design Considerations
• Implementation Guidelines
• Reference Architecture

4
Introduction
 In modern data management infrastructure, data lake is an important repository that holds a vast
amount of data in its native format until it is needed. An Enterprise Data Lake (EDL) can not only
store and share enterprise-wide information but is also capable of performing a variety of
enterprise workload activities like batch processing, streaming processing, interactive SQL,
enterprise search and advanced analytics
 The adoption of data lakes is increasing and organizations are leveraging data lakes for numerous
use cases. As usage increases, operational challenges like resource clogging also increase,
resulting in failure of critical business processes and impacting service level agreements (SLAs)
 Earlier, most data lake implementations were on-premise. However, organizations today are
leveraging cloud technology to replace or expand their existing data lake implementations
 Data lake multi-tenancy is an important architectural paradigm that enables multiple business
users and processes to share a common set of resources, such as Apache Hadoop clusters.
 This includes setting up appropriate policies around resource provisioning and access while
meeting SLAs and security requirements for each tenant
 There are different sets of guidelines to achieve multitenancy on-premise as well as on-cloud,
some of the key design principle and implementation guidelines are described in this document

5
 Resource and Cost Optimization: Having multiple business units share the same cluster resources
brings significant cost savings (across hardware and operational costs)
 Collaboration and Decision Making: Multi-tenancy helps achieve data sharing and collaboration
between different teams and integrate multiple data silos to get a unified view of data
 Operational Simplicity: Sharing an cluster among multiple users makes it dramatically easier to
deploy and manage
 Wide Audience: A multi-tenant, cloud based cluster enables a wide range of stakeholders
(developers, analysts, data scientists) from different organizational units to access and use the
data
 Seamless Scalability: A multi-tenant model makes it easy for IT teams to scale-out clusters to
meet changing business needs (e.g., business growth, new markets, reorganizations, mergers and
acquisitions, etc.)
Key Drivers for Data Lake Multitenancy

6
On-Premise Design Considerations
Key design considerations to achieve data lake multitenancy:
Defining
the Tenant
Identification of different business units that would access the
cluster. All business units should have defined use cases to
leverage enterprise data lake
Selecting the
Right Isolation
Model
Strategic cluster design can be done using one of three key
architectural models - Share Nothing, Shared Management and
Shared Resources
Defining the
Resource Usage
Agreement
Ensure complete coverage of utilization and SLA requirements
across storage, compute, data governance, tracking, auditing and
on-boarding

7
On-Premise Design: Isolation Models
Model 1: Share Nothing
 Cluster management and data are segregated
 Does not leverage multitenancy , but can be used by IT
teams based on operational realities and governance
policies
Model 2: Shared Management
 Cluster management is shared, while data and resources
are separated for tenant groups
 Useful for high-priority clusters that can’t afford to risk any
performance issues or resource contentions
Model 3: Shared Resources
 Leverages the multitenancy benefits from consolidated
cluster management, shared data and resources
 Isolation model is recommended for development of
enterprise data lakes
Cluster Manager
Tenant A Tenant B
Cluster Manager
Tenant A Tenant B
Cluster
Manager
Tenant B
Cluster
Manager
Tenant A

8
On-Premise Design: Resource Usage Agreement
Key Components
Storage
 Every tenant group on the cluster should have access to it’s section (namespace
/ directory)
 A dedicated directory should be assigned to users in a tenant group to store
data
 The storage quota needs to be controlled to ensure work of other groups and
users isn’t impacted due to influx of large data
Compute
 All tenant groups should be guaranteed an agreed upon minimum compute
power at all times
 Recommended to provide compute power at tenant group level irrespective of
the tenant users
Data
Governance
 Define metadata management and data lineage
Tracking &
Auditing
 Track the cluster access and generate report
 Audit data asset accessed with other metadata like access time, IP address etc.
On-boarding
 Design the hierarchy of the tenant groups and service accounts
 The process of adding tenant groups and users to the cluster should be
straightforward

9
On-Premise Implementation Guidelines (1/8)
Hadoop Distributed File System (HDFS) Resource
Management
HDFS is the key storage unit in data lake and is shared by all
the tenant groups, users and service accounts for processing
jobs. HDFS storage is broadly classified into three categories:
 LOB Space: Space allocated to particular line of business
like Finance, Marketing, etc.
 User Space: Dedicated space for individual users for
development / experimentation
 Enterprise Space: This layer stores all datasets (raw or
processed) used by multiple business groups
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance

10
Storage Structure
The structure of HDFS storage should be simple and clearly
isolate the different business units’ raw data.
e.g. Storage Structure
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance

11
HDFS Quota Management
HDFS supports two quota mechanisms that administrators
can utilize to manage space usage by cluster tenants:
Disk Space Quotas
 Sets disk space limits on a per-directory basis
 Prevents users from accidentally or maliciously
consuming excess disk space within the cluster
Name Space Quotas
 Limits the number of files or subdirectories within a
particular directory
 Helps administrators optimize the metadata subsystem
(NameNode) within the Hadoop cluster
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance

12
HDFS Resource Isolation
 HDFS resource isolation is achieved using HDFS federation (HDFS 2.X)
 HDFS federation in Hadoop uses multiple independent namenodes /
namespaces to horizontally scale the name service
 All data nodes are used as common storage for blocks by all the name
nodes. Each data node registers with all the name nodes in the cluster
Limitations of Single Namespace / Namenode
 Namespace and block storage are tightly coupled
 The namespace isn’t scalable like data node. Horizontal scalability in HDFS
cluster is achievable with the addition of more data nodes
 Hadoop’s performance depends on throughput of the namenode.
Operation of current file system depends on the throughput of a single
namenode
 There is no separation of namespaces. This results in no isolation among
tenant organizations which are using the cluster
Benefits of HDFS Federation
 No isolation in a single namenode in a multi-user environment. Different
categories of applications and users can be put into multiple namespaces
by using multiple namenodes
 Namenodes in federation scale up horizontally in the file system’s
namespace
 Read / Write throughput can be improved by adding more namenodes
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance

13
SQL-On-Hadoop Data Management
There are three different ways to manage multi-tenant data on SQL-
on-Hadoop database. The data is eventually stored on HDFS but can
be viewed using SQL tools like Hive, HAWQ, IMPALA etc.
 Separated Databases: Storing tenant data in separate databases
is the simplest approach to data isolation
 Shared Database and Separate Schema: This approach involves
housing multiple tenants in the same database. Each tenant has
its own set of tables that are grouped into a schema created
specifically for that tenant
 Shared Database and Schema: The same database and same set
of tables host multiple tenants' data. Each table has records from
multiple tenant and is segregated by tenant’s id column value
Separate
Schema
Shared
Schema
Separated
DatabasesIsolated Shared
Implementing Shared Database and Separate Schema is recommended.
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance

14
Compute
 YARN Capacity Scheduler is used as a resource
management application to allocate shared cluster
resources among users and groups
 The queue is an important component of scheduling in
YARN and for isolating resources. Important queue
properties are:
• Queue name
• Queue path name
• Associated child queue and application
• Minimum-maximum capacity of the queue
 Resources can be allocated with the Capacity Scheduler by:
• Enabling the Capacity Scheduler
• Setting up Queues
• Controlling access to
queues with
ACLs (Access Control List)
Root
Finance Marketing Operations
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
e.g. Setting up a queue hierarchy

15
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
Security
Authentication  Need to know who users are and groups they belong to
 Kerberos is the only authentication method supported by
most components
Authorization  Need to know what users can access and what level of
permissions they have
 In a multi-tenant environment, service accounts / IDs
need to be setup at the group and enterprise levels
 Service accounts exist in the enterprise identity store, and
are provided to end users using Apache Ranger / Sentry
Auditing  Determines who did what and when
 Apache Ranger / Sentry or custom solution for auditing
Data Protection  Data in Transit
• SSL / TLS needs to be enabled to encrypt data
between clients and service endpoints
• Keys / certificates configured as per service / role
 Data at Rest
• Multiple encryption zones on HDFS allow only
authorized users to access data
• Data is transmitted in an encrypted form as encryption
is on HDFS block level
• Keys can be stored in Java keystore or HSM

16
Governance
Apace Atlas can be used for managing the metadata of the data
assets and keep track of the lineage information.
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance

17
On-Premise Reference Architecture
HDFS
(Hadoop Distributed File System)
Data Access
YARN: Data Operating System
Governance
Security&Operation
1. HDFS Quota Management
• Disk Space Quota
• Name Space Quota
2. HDFS Federation
YARN Capacity Scheduler
Impala
Dril
Presto
NoSQL
Solr
SQL
Streaming
Milb
GraphX
Spark
Hive
Pig
Cascading
Tez
Hive
Pig
Giraph
Mahout
Map Reduce
Data Management
Data Lake
OperationsData Science BI & DWH Marketing

18
On-Cloud Design Considerations
Infrastructure Isolation
(Physical & Logical)
Virtualization Using
Multiple VM Support
Cloud Automation &
Integration
Catalogue
Manager
 Infrastructure shared
physically or logically
based on security
expectations (Multi-
tenant environment)
 Complete isolation of
compute using
dedicated cluster or
resource pool
 Logical isolation using
virtual machine level
 Complete network
isolation through
dedicated network for
every tenant and VLAN
for logical isolation
 Infrastructure
components are
virtualized and
managed as a single
entity, and are
simultaneously
isolated based on
tenancy
 Security and
compliance needs met
through anti-
collocation,
hypervisor-level
firewalls, resource
grouping of compute
and storage, and
VLAN-based isolations
 Essential to support
client specific business
processes, identity
management and
integration with tools
and services
 In the multitenant
cloud environment, it
is essential to define
standard processes
and practice for
providing
customization
flexibility
 Each aspect of multi-
tenancy must
culminate in an
intuitive user interface
 Catalogue content
depends on tenancy
and individual
privileges
 Multi-tenancy at a
catalogue level may
authorize integration
with different
directory services for
each tenant

19
On-Cloud Implementation Guideline (1/4)
Key guidelines to achieve multitenant data lake environment:
Storage
Object storage are the key storage units in a cloud data lake. Below are some of the key cloud
providers :
Storage Management
Multitenancy can be achieved using two methods:
 Storage Account: Isolation of storage can be implemented using separate storage account for each
tenant group along with appropriate identity and access management
 Containers / Buckets: By creating separate containers / buckets for each tenant group with
appropriate identity and access management
Cloud Provider AWS Azure Google
Service Name S3 Azure Storage (Blob) Google Cloud storage
Hot S3 Standard Hot Blob Storage GCS
Cool S3 Standard - Infrequent Access Cool Blob Storage GCS Nearline
Archival Glacier Archive Blob storage GCS Coldline
Object Limit Unlimited Unlimited Unlimited
Size Limit 5 TB / Object 500 TB / Account 5 TB / Object

20
Resource Group
Best approach is to create a separate resource group for each LOB tenant
Big Data Processing
Big data processing services provided by key cloud providers:
Data Processing Isolation
 Creating multiple services within the same subscription for each LOB. e.g. operational department
have their own Azure / AWS Databricks service for a given subscription
 Deployment of multiple clusters for each tenant group or user with appropriate identity and
access management
 Data processing results can be stored into the tenant specific storage accounts / buckets
AWS Azure Google
Service Name
Elastic MapReduce (EMR)
| AWS Databricks
HDInsight | Azure
Databricks
Google Cloud
Dataproc

21
Security, Identity and Access
Cloud service providers are responsible for security of data centers
Application Security Implementation
 Configure AWS / Azure Stack to support users from multiple IAM / Azure Active Directory
(Azure AD) tenants to use services in AWS / Azure Stack
 Provide authorization using AWS Organization / Azure RBAC
 Data encryption can be secured using AWS Key Management Service / Azure Key Vault
Security Features from
Cloud Providers
AWS Azure Google
Authentication &
Authorization
Identity and Access
Management (IAM)
Azure Active
Directory
Google Cloud Identity
and Access Management
AWS Organization Azure RBAC -
Encryption
Server-side Encryption with
Amazon S3 Key
Management Service
Azure Storage Service
Encryption
Google / Customer
Managed Encryption Key
Key Management Service Key Vault -

22
Network Services
Multi-tenant Network Strategy
 Isolate the application servers on their own physical network. This approach works for single
tenants on dedicated servers
 Define Virtual Switches (vSwitches) for each tenant. vSwitches can bring all relevant virtual
machines together on one logical switch. Similar to virtual machines, vSwitches can move within
the cloud environment
 Configure Virtual Local Area Network (VLANs) and create separate network for each tenant
Network Services from
Cloud Providers
AWS Azure Google
Cloud Virtual Network
Virtual Private Cloud
(VPC)
Virtual Network
Virtual Private Cloud
(VPC)
Cross Premise
Connectivity
AWS VPN Gateway Azure VPN Gateway Google VPN Gateway
Dedicated Network Direct Connect Express Route Dedicated Interconnect

23
On-Cloud Reference Architecture
Customer DCustomer A Customer B Customer C
Cloud Data Storage
(S3 / Azure Blob, ADLS / GCS)
Data Access
YARN: Data Operating System
Governance
Security&Operation
Data Management
SQL
Streamin
g
Milb
GraphX
Databricks Spark Redshift
Azure
SQLDWH
Elastic DWH
Hive
HBase
Spark
Hadoop
HDInsight / EMR
IaaS
Impala
Dril
Presto
NoSQL
Solr
Load Balancer
VM
VMInstance
1
Instance
2
Instance
n-1
Instance
n
VM
VM

24
 http//archive.gtra.org/files/Multitenancy_and_the_Enterprise_Data_Hub.pdf
 http://thesai.org/Downloads/Volume5No11/Paper_23-A_Hybrid_Multi-
Tenant_Database_Schema_for_Multi-Level_Quality_of_Service.pdf
 https://www.linkedin.com/pulse/multi-tenancy-deployment-options-big-data-design-pattern-
tom-martin/
 https://www.ibm.com/blogs/cloud-computing/2016/08/16/design-considerations-multi-tenant-
cloud/
 https://www.networkworld.com/article/3191520/cloud-computing/deep-dive-on-aws-vs-azure-
vs-google-cloud-storage-options.html
 https://securosis.com/assets/library/reports/Securing_Hadoop_Final_V2.pdf
 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-
management/bk_yarn-resource-management.pdf
 http://www.jamesserra.com/archive/2016/07/multi-tenant-databases-in-the-cloud/
 https://blogs.technet.microsoft.com/yungchou/2013/08/08/resource-pooling-virtualization-
fabric-and-cloud/
 http://dataconomy.com/2017/11/building-governed-data-lake-cloud/
References

25
 Enterprise Datalake
 Multitenancy
 HDFS Resource Management
 HDFS Resource Isolation
 HDFS Quota Management
Key Words

26
Thank You
Authors:
Sanjay Upadhyay2
thoughtleaders@citiustech.com
About CitiusTech
3,200+
healthcare IT professionals worldwide
100%
healthcare industry focus
30%+
CAGR over last 5 years
110+
healthcare customers
• Healthcare technology companies
• Hospitals, IDNs & medical groups
• Payers and health plans
• ACOs, MCOs, HIEs, HIXs, NHINs
• Pharma & Life Sciences companies

Data Lake - Multitenancy Best Practices

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Data Lake - Multitenancy Best Practices

Semelhante a Data Lake - Multitenancy Best Practices (20)

Mais de CitiusTech

Mais de CitiusTech (20)

Último

Último (20)

Data Lake - Multitenancy Best Practices