This document covers guidelines around achieving multitenancy in a data lake environment. It mentions the different design and implementation guidelines necessary for on premise as well as cloud-based multitenant data lake, and highlights the reference architecture for both these deployment options.
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Data Lake - Multitenancy Best Practices
1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Data Lake – Multitenancy Best Practices
30 November, 2018 | Author: Sanjay Upadhyay; Sr. Solution Architect
CitiusTech Thought
Leadership
2. 2
Objective
Multitenancy in a data lake allows organizations to share their cluster resources across user
communities without impacting business SLAs and capabilities or security and privacy needs
This document covers guidelines around achieving multitenancy in a data lake environment
It mentions the different design and implementation guidelines necessary for on premise as well
as cloud-based multitenant data lake, and highlights the reference architecture for both these
deployment options
3. 3
Agenda
Introduction
Key Drivers for Data Lake Multitenancy
On-Premise Data Lake Multitenancy
• Design Considerations
• Implementation Guidelines
• Reference Architecture
On-Cloud Data Lake Multitenancy
• Design Considerations
• Implementation Guidelines
• Reference Architecture
4. 4
Introduction
In modern data management infrastructure, data lake is an important repository that holds a vast
amount of data in its native format until it is needed. An Enterprise Data Lake (EDL) can not only
store and share enterprise-wide information but is also capable of performing a variety of
enterprise workload activities like batch processing, streaming processing, interactive SQL,
enterprise search and advanced analytics
The adoption of data lakes is increasing and organizations are leveraging data lakes for numerous
use cases. As usage increases, operational challenges like resource clogging also increase,
resulting in failure of critical business processes and impacting service level agreements (SLAs)
Earlier, most data lake implementations were on-premise. However, organizations today are
leveraging cloud technology to replace or expand their existing data lake implementations
Data lake multi-tenancy is an important architectural paradigm that enables multiple business
users and processes to share a common set of resources, such as Apache Hadoop clusters.
This includes setting up appropriate policies around resource provisioning and access while
meeting SLAs and security requirements for each tenant
There are different sets of guidelines to achieve multitenancy on-premise as well as on-cloud,
some of the key design principle and implementation guidelines are described in this document
5. 5
Resource and Cost Optimization: Having multiple business units share the same cluster resources
brings significant cost savings (across hardware and operational costs)
Collaboration and Decision Making: Multi-tenancy helps achieve data sharing and collaboration
between different teams and integrate multiple data silos to get a unified view of data
Operational Simplicity: Sharing an cluster among multiple users makes it dramatically easier to
deploy and manage
Wide Audience: A multi-tenant, cloud based cluster enables a wide range of stakeholders
(developers, analysts, data scientists) from different organizational units to access and use the
data
Seamless Scalability: A multi-tenant model makes it easy for IT teams to scale-out clusters to
meet changing business needs (e.g., business growth, new markets, reorganizations, mergers and
acquisitions, etc.)
Key Drivers for Data Lake Multitenancy
6. 6
On-Premise Design Considerations
Key design considerations to achieve data lake multitenancy:
Defining
the Tenant
Identification of different business units that would access the
cluster. All business units should have defined use cases to
leverage enterprise data lake
Selecting the
Right Isolation
Model
Strategic cluster design can be done using one of three key
architectural models - Share Nothing, Shared Management and
Shared Resources
Defining the
Resource Usage
Agreement
Ensure complete coverage of utilization and SLA requirements
across storage, compute, data governance, tracking, auditing and
on-boarding
7. 7
On-Premise Design: Isolation Models
Model 1: Share Nothing
Cluster management and data are segregated
Does not leverage multitenancy , but can be used by IT
teams based on operational realities and governance
policies
Model 2: Shared Management
Cluster management is shared, while data and resources
are separated for tenant groups
Useful for high-priority clusters that can’t afford to risk any
performance issues or resource contentions
Model 3: Shared Resources
Leverages the multitenancy benefits from consolidated
cluster management, shared data and resources
Isolation model is recommended for development of
enterprise data lakes
Cluster Manager
Tenant A Tenant B
Cluster Manager
Tenant A Tenant B
Cluster
Manager
Tenant B
Cluster
Manager
Tenant A
8. 8
On-Premise Design: Resource Usage Agreement
Key Components
Storage
Every tenant group on the cluster should have access to it’s section (namespace
/ directory)
A dedicated directory should be assigned to users in a tenant group to store
data
The storage quota needs to be controlled to ensure work of other groups and
users isn’t impacted due to influx of large data
Compute
All tenant groups should be guaranteed an agreed upon minimum compute
power at all times
Recommended to provide compute power at tenant group level irrespective of
the tenant users
Data
Governance
Define metadata management and data lineage
Tracking &
Auditing
Track the cluster access and generate report
Audit data asset accessed with other metadata like access time, IP address etc.
On-boarding
Design the hierarchy of the tenant groups and service accounts
The process of adding tenant groups and users to the cluster should be
straightforward
9. 9
On-Premise Implementation Guidelines (1/8)
Hadoop Distributed File System (HDFS) Resource
Management
HDFS is the key storage unit in data lake and is shared by all
the tenant groups, users and service accounts for processing
jobs. HDFS storage is broadly classified into three categories:
LOB Space: Space allocated to particular line of business
like Finance, Marketing, etc.
User Space: Dedicated space for individual users for
development / experimentation
Enterprise Space: This layer stores all datasets (raw or
processed) used by multiple business groups
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
10. 10
On-Premise Implementation Guidelines (2/8)
Storage Structure
The structure of HDFS storage should be simple and clearly
isolate the different business units’ raw data.
e.g. Storage Structure
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
11. 11
On-Premise Implementation Guidelines (3/8)
HDFS Quota Management
HDFS supports two quota mechanisms that administrators
can utilize to manage space usage by cluster tenants:
Disk Space Quotas
Sets disk space limits on a per-directory basis
Prevents users from accidentally or maliciously
consuming excess disk space within the cluster
Name Space Quotas
Limits the number of files or subdirectories within a
particular directory
Helps administrators optimize the metadata subsystem
(NameNode) within the Hadoop cluster
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
12. 12
On-Premise Implementation Guidelines (4/8)
HDFS Resource Isolation
HDFS resource isolation is achieved using HDFS federation (HDFS 2.X)
HDFS federation in Hadoop uses multiple independent namenodes /
namespaces to horizontally scale the name service
All data nodes are used as common storage for blocks by all the name
nodes. Each data node registers with all the name nodes in the cluster
Limitations of Single Namespace / Namenode
Namespace and block storage are tightly coupled
The namespace isn’t scalable like data node. Horizontal scalability in HDFS
cluster is achievable with the addition of more data nodes
Hadoop’s performance depends on throughput of the namenode.
Operation of current file system depends on the throughput of a single
namenode
There is no separation of namespaces. This results in no isolation among
tenant organizations which are using the cluster
Benefits of HDFS Federation
No isolation in a single namenode in a multi-user environment. Different
categories of applications and users can be put into multiple namespaces
by using multiple namenodes
Namenodes in federation scale up horizontally in the file system’s
namespace
Read / Write throughput can be improved by adding more namenodes
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
13. 13
On-Premise Implementation Guidelines (5/8)
SQL-On-Hadoop Data Management
There are three different ways to manage multi-tenant data on SQL-
on-Hadoop database. The data is eventually stored on HDFS but can
be viewed using SQL tools like Hive, HAWQ, IMPALA etc.
Separated Databases: Storing tenant data in separate databases
is the simplest approach to data isolation
Shared Database and Separate Schema: This approach involves
housing multiple tenants in the same database. Each tenant has
its own set of tables that are grouped into a schema created
specifically for that tenant
Shared Database and Schema: The same database and same set
of tables host multiple tenants' data. Each table has records from
multiple tenant and is segregated by tenant’s id column value
Separate
Schema
Shared
Schema
Separated
DatabasesIsolated Shared
Implementing Shared Database and Separate Schema is recommended.
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
14. 14
On-Premise Implementation Guidelines (6/8)
Compute
YARN Capacity Scheduler is used as a resource
management application to allocate shared cluster
resources among users and groups
The queue is an important component of scheduling in
YARN and for isolating resources. Important queue
properties are:
• Queue name
• Queue path name
• Associated child queue and application
• Minimum-maximum capacity of the queue
Resources can be allocated with the Capacity Scheduler by:
• Enabling the Capacity Scheduler
• Setting up Queues
• Controlling access to
queues with
ACLs (Access Control List)
Root
Finance Marketing Operations
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
e.g. Setting up a queue hierarchy
15. 15
On-Premise Implementation Guidelines (7/8)
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
Security
Authentication Need to know who users are and groups they belong to
Kerberos is the only authentication method supported by
most components
Authorization Need to know what users can access and what level of
permissions they have
In a multi-tenant environment, service accounts / IDs
need to be setup at the group and enterprise levels
Service accounts exist in the enterprise identity store, and
are provided to end users using Apache Ranger / Sentry
Auditing Determines who did what and when
Apache Ranger / Sentry or custom solution for auditing
Data Protection Data in Transit
• SSL / TLS needs to be enabled to encrypt data
between clients and service endpoints
• Keys / certificates configured as per service / role
Data at Rest
• Multiple encryption zones on HDFS allow only
authorized users to access data
• Data is transmitted in an encrypted form as encryption
is on HDFS block level
• Keys can be stored in Java keystore or HSM
16. 16
On-Premise Implementation Guidelines (8/8)
Governance
Apace Atlas can be used for managing the metadata of the data
assets and keep track of the lineage information.
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
17. 17
On-Premise Reference Architecture
HDFS
(Hadoop Distributed File System)
Data Access
YARN: Data Operating System
Governance
Security&Operation
1. HDFS Quota Management
• Disk Space Quota
• Name Space Quota
2. HDFS Federation
YARN Capacity Scheduler
Impala
Dril
Presto
NoSQL
Solr
SQL
Streaming
Milb
GraphX
Spark
Hive
Pig
Cascading
Tez
Hive
Pig
Giraph
Mahout
Map Reduce
Data Management
Data Lake
OperationsData Science BI & DWH Marketing
18. 18
On-Cloud Design Considerations
Infrastructure Isolation
(Physical & Logical)
Virtualization Using
Multiple VM Support
Cloud Automation &
Integration
Catalogue
Manager
Infrastructure shared
physically or logically
based on security
expectations (Multi-
tenant environment)
Complete isolation of
compute using
dedicated cluster or
resource pool
Logical isolation using
virtual machine level
Complete network
isolation through
dedicated network for
every tenant and VLAN
for logical isolation
Infrastructure
components are
virtualized and
managed as a single
entity, and are
simultaneously
isolated based on
tenancy
Security and
compliance needs met
through anti-
collocation,
hypervisor-level
firewalls, resource
grouping of compute
and storage, and
VLAN-based isolations
Essential to support
client specific business
processes, identity
management and
integration with tools
and services
In the multitenant
cloud environment, it
is essential to define
standard processes
and practice for
providing
customization
flexibility
Each aspect of multi-
tenancy must
culminate in an
intuitive user interface
Catalogue content
depends on tenancy
and individual
privileges
Multi-tenancy at a
catalogue level may
authorize integration
with different
directory services for
each tenant
19. 19
On-Cloud Implementation Guideline (1/4)
Key guidelines to achieve multitenant data lake environment:
Storage
Object storage are the key storage units in a cloud data lake. Below are some of the key cloud
providers :
Storage Management
Multitenancy can be achieved using two methods:
Storage Account: Isolation of storage can be implemented using separate storage account for each
tenant group along with appropriate identity and access management
Containers / Buckets: By creating separate containers / buckets for each tenant group with
appropriate identity and access management
Cloud Provider AWS Azure Google
Service Name S3 Azure Storage (Blob) Google Cloud storage
Hot S3 Standard Hot Blob Storage GCS
Cool S3 Standard - Infrequent Access Cool Blob Storage GCS Nearline
Archival Glacier Archive Blob storage GCS Coldline
Object Limit Unlimited Unlimited Unlimited
Size Limit 5 TB / Object 500 TB / Account 5 TB / Object
20. 20
Resource Group
Best approach is to create a separate resource group for each LOB tenant
Big Data Processing
Big data processing services provided by key cloud providers:
Data Processing Isolation
Creating multiple services within the same subscription for each LOB. e.g. operational department
have their own Azure / AWS Databricks service for a given subscription
Deployment of multiple clusters for each tenant group or user with appropriate identity and
access management
Data processing results can be stored into the tenant specific storage accounts / buckets
On-Cloud Implementation Guideline (2/4)
AWS Azure Google
Service Name
Elastic MapReduce (EMR)
| AWS Databricks
HDInsight | Azure
Databricks
Google Cloud
Dataproc
21. 21
On-Cloud Implementation Guideline (3/4)
Security, Identity and Access
Cloud service providers are responsible for security of data centers
Application Security Implementation
Configure AWS / Azure Stack to support users from multiple IAM / Azure Active Directory
(Azure AD) tenants to use services in AWS / Azure Stack
Provide authorization using AWS Organization / Azure RBAC
Data encryption can be secured using AWS Key Management Service / Azure Key Vault
Security Features from
Cloud Providers
AWS Azure Google
Authentication &
Authorization
Identity and Access
Management (IAM)
Azure Active
Directory
Google Cloud Identity
and Access Management
AWS Organization Azure RBAC -
Encryption
Server-side Encryption with
Amazon S3 Key
Management Service
Azure Storage Service
Encryption
Google / Customer
Managed Encryption Key
Key Management Service Key Vault -
22. 22
Network Services
Multi-tenant Network Strategy
Isolate the application servers on their own physical network. This approach works for single
tenants on dedicated servers
Define Virtual Switches (vSwitches) for each tenant. vSwitches can bring all relevant virtual
machines together on one logical switch. Similar to virtual machines, vSwitches can move within
the cloud environment
Configure Virtual Local Area Network (VLANs) and create separate network for each tenant
On-Cloud Implementation Guideline (4/4)
Network Services from
Cloud Providers
AWS Azure Google
Cloud Virtual Network
Virtual Private Cloud
(VPC)
Virtual Network
Virtual Private Cloud
(VPC)
Cross Premise
Connectivity
AWS VPN Gateway Azure VPN Gateway Google VPN Gateway
Dedicated Network Direct Connect Express Route Dedicated Interconnect
23. 23
On-Cloud Reference Architecture
Customer DCustomer A Customer B Customer C
Cloud Data Storage
(S3 / Azure Blob, ADLS / GCS)
Data Access
YARN: Data Operating System
Governance
Security&Operation
Data Management
SQL
Streamin
g
Milb
GraphX
Databricks Spark Redshift
Azure
SQLDWH
Elastic DWH
Hive
HBase
Spark
Hadoop
HDInsight / EMR
IaaS
Impala
Dril
Presto
NoSQL
Solr
Load Balancer
VM
VMInstance
1
Instance
2
Instance
n-1
Instance
n
VM
VM
26. 26
Thank You
Authors:
Sanjay Upadhyay2
thoughtleaders@citiustech.com
About CitiusTech
3,200+
healthcare IT professionals worldwide
100%
healthcare industry focus
30%+
CAGR over last 5 years
110+
healthcare customers
• Healthcare technology companies
• Hospitals, IDNs & medical groups
• Payers and health plans
• ACOs, MCOs, HIEs, HIXs, NHINs
• Pharma & Life Sciences companies