SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information
contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech.
Data Lake – Multitenancy Best Practices
30 November, 2018 | Author: Sanjay Upadhyay; Sr. Solution Architect
CitiusTech Thought
Leadership
2
Objective
 Multitenancy in a data lake allows organizations to share their cluster resources across user
communities without impacting business SLAs and capabilities or security and privacy needs
 This document covers guidelines around achieving multitenancy in a data lake environment
 It mentions the different design and implementation guidelines necessary for on premise as well
as cloud-based multitenant data lake, and highlights the reference architecture for both these
deployment options
3
Agenda
 Introduction
 Key Drivers for Data Lake Multitenancy
 On-Premise Data Lake Multitenancy
• Design Considerations
• Implementation Guidelines
• Reference Architecture
 On-Cloud Data Lake Multitenancy
• Design Considerations
• Implementation Guidelines
• Reference Architecture
4
Introduction
 In modern data management infrastructure, data lake is an important repository that holds a vast
amount of data in its native format until it is needed. An Enterprise Data Lake (EDL) can not only
store and share enterprise-wide information but is also capable of performing a variety of
enterprise workload activities like batch processing, streaming processing, interactive SQL,
enterprise search and advanced analytics
 The adoption of data lakes is increasing and organizations are leveraging data lakes for numerous
use cases. As usage increases, operational challenges like resource clogging also increase,
resulting in failure of critical business processes and impacting service level agreements (SLAs)
 Earlier, most data lake implementations were on-premise. However, organizations today are
leveraging cloud technology to replace or expand their existing data lake implementations
 Data lake multi-tenancy is an important architectural paradigm that enables multiple business
users and processes to share a common set of resources, such as Apache Hadoop clusters.
 This includes setting up appropriate policies around resource provisioning and access while
meeting SLAs and security requirements for each tenant
 There are different sets of guidelines to achieve multitenancy on-premise as well as on-cloud,
some of the key design principle and implementation guidelines are described in this document
5
 Resource and Cost Optimization: Having multiple business units share the same cluster resources
brings significant cost savings (across hardware and operational costs)
 Collaboration and Decision Making: Multi-tenancy helps achieve data sharing and collaboration
between different teams and integrate multiple data silos to get a unified view of data
 Operational Simplicity: Sharing an cluster among multiple users makes it dramatically easier to
deploy and manage
 Wide Audience: A multi-tenant, cloud based cluster enables a wide range of stakeholders
(developers, analysts, data scientists) from different organizational units to access and use the
data
 Seamless Scalability: A multi-tenant model makes it easy for IT teams to scale-out clusters to
meet changing business needs (e.g., business growth, new markets, reorganizations, mergers and
acquisitions, etc.)
Key Drivers for Data Lake Multitenancy
6
On-Premise Design Considerations
Key design considerations to achieve data lake multitenancy:
Defining
the Tenant
Identification of different business units that would access the
cluster. All business units should have defined use cases to
leverage enterprise data lake
Selecting the
Right Isolation
Model
Strategic cluster design can be done using one of three key
architectural models - Share Nothing, Shared Management and
Shared Resources
Defining the
Resource Usage
Agreement
Ensure complete coverage of utilization and SLA requirements
across storage, compute, data governance, tracking, auditing and
on-boarding
7
On-Premise Design: Isolation Models
Model 1: Share Nothing
 Cluster management and data are segregated
 Does not leverage multitenancy , but can be used by IT
teams based on operational realities and governance
policies
Model 2: Shared Management
 Cluster management is shared, while data and resources
are separated for tenant groups
 Useful for high-priority clusters that can’t afford to risk any
performance issues or resource contentions
Model 3: Shared Resources
 Leverages the multitenancy benefits from consolidated
cluster management, shared data and resources
 Isolation model is recommended for development of
enterprise data lakes
Cluster Manager
Tenant A Tenant B
Cluster Manager
Tenant A Tenant B
Cluster
Manager
Tenant B
Cluster
Manager
Tenant A
8
On-Premise Design: Resource Usage Agreement
Key Components
Storage
 Every tenant group on the cluster should have access to it’s section (namespace
/ directory)
 A dedicated directory should be assigned to users in a tenant group to store
data
 The storage quota needs to be controlled to ensure work of other groups and
users isn’t impacted due to influx of large data
Compute
 All tenant groups should be guaranteed an agreed upon minimum compute
power at all times
 Recommended to provide compute power at tenant group level irrespective of
the tenant users
Data
Governance
 Define metadata management and data lineage
Tracking &
Auditing
 Track the cluster access and generate report
 Audit data asset accessed with other metadata like access time, IP address etc.
On-boarding
 Design the hierarchy of the tenant groups and service accounts
 The process of adding tenant groups and users to the cluster should be
straightforward
9
On-Premise Implementation Guidelines (1/8)
Hadoop Distributed File System (HDFS) Resource
Management
HDFS is the key storage unit in data lake and is shared by all
the tenant groups, users and service accounts for processing
jobs. HDFS storage is broadly classified into three categories:
 LOB Space: Space allocated to particular line of business
like Finance, Marketing, etc.
 User Space: Dedicated space for individual users for
development / experimentation
 Enterprise Space: This layer stores all datasets (raw or
processed) used by multiple business groups
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
10
On-Premise Implementation Guidelines (2/8)
Storage Structure
The structure of HDFS storage should be simple and clearly
isolate the different business units’ raw data.
e.g. Storage Structure
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
11
On-Premise Implementation Guidelines (3/8)
HDFS Quota Management
HDFS supports two quota mechanisms that administrators
can utilize to manage space usage by cluster tenants:
Disk Space Quotas
 Sets disk space limits on a per-directory basis
 Prevents users from accidentally or maliciously
consuming excess disk space within the cluster
Name Space Quotas
 Limits the number of files or subdirectories within a
particular directory
 Helps administrators optimize the metadata subsystem
(NameNode) within the Hadoop cluster
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
12
On-Premise Implementation Guidelines (4/8)
HDFS Resource Isolation
 HDFS resource isolation is achieved using HDFS federation (HDFS 2.X)
 HDFS federation in Hadoop uses multiple independent namenodes /
namespaces to horizontally scale the name service
 All data nodes are used as common storage for blocks by all the name
nodes. Each data node registers with all the name nodes in the cluster
Limitations of Single Namespace / Namenode
 Namespace and block storage are tightly coupled
 The namespace isn’t scalable like data node. Horizontal scalability in HDFS
cluster is achievable with the addition of more data nodes
 Hadoop’s performance depends on throughput of the namenode.
Operation of current file system depends on the throughput of a single
namenode
 There is no separation of namespaces. This results in no isolation among
tenant organizations which are using the cluster
Benefits of HDFS Federation
 No isolation in a single namenode in a multi-user environment. Different
categories of applications and users can be put into multiple namespaces
by using multiple namenodes
 Namenodes in federation scale up horizontally in the file system’s
namespace
 Read / Write throughput can be improved by adding more namenodes
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
13
On-Premise Implementation Guidelines (5/8)
SQL-On-Hadoop Data Management
There are three different ways to manage multi-tenant data on SQL-
on-Hadoop database. The data is eventually stored on HDFS but can
be viewed using SQL tools like Hive, HAWQ, IMPALA etc.
 Separated Databases: Storing tenant data in separate databases
is the simplest approach to data isolation
 Shared Database and Separate Schema: This approach involves
housing multiple tenants in the same database. Each tenant has
its own set of tables that are grouped into a schema created
specifically for that tenant
 Shared Database and Schema: The same database and same set
of tables host multiple tenants' data. Each table has records from
multiple tenant and is segregated by tenant’s id column value
Separate
Schema
Shared
Schema
Separated
DatabasesIsolated Shared
Implementing Shared Database and Separate Schema is recommended.
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
14
On-Premise Implementation Guidelines (6/8)
Compute
 YARN Capacity Scheduler is used as a resource
management application to allocate shared cluster
resources among users and groups
 The queue is an important component of scheduling in
YARN and for isolating resources. Important queue
properties are:
• Queue name
• Queue path name
• Associated child queue and application
• Minimum-maximum capacity of the queue
 Resources can be allocated with the Capacity Scheduler by:
• Enabling the Capacity Scheduler
• Setting up Queues
• Controlling access to
queues with
ACLs (Access Control List)
Root
Finance Marketing Operations
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
e.g. Setting up a queue hierarchy
15
On-Premise Implementation Guidelines (7/8)
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
Security
Authentication  Need to know who users are and groups they belong to
 Kerberos is the only authentication method supported by
most components
Authorization  Need to know what users can access and what level of
permissions they have
 In a multi-tenant environment, service accounts / IDs
need to be setup at the group and enterprise levels
 Service accounts exist in the enterprise identity store, and
are provided to end users using Apache Ranger / Sentry
Auditing  Determines who did what and when
 Apache Ranger / Sentry or custom solution for auditing
Data Protection  Data in Transit
• SSL / TLS needs to be enabled to encrypt data
between clients and service endpoints
• Keys / certificates configured as per service / role
 Data at Rest
• Multiple encryption zones on HDFS allow only
authorized users to access data
• Data is transmitted in an encrypted form as encryption
is on HDFS block level
• Keys can be stored in Java keystore or HSM
16
On-Premise Implementation Guidelines (8/8)
Governance
Apace Atlas can be used for managing the metadata of the data
assets and keep track of the lineage information.
HDFS Resource
Management
Storage
Structure
HDFS Quota
Management
HDFS Resource
Isolation
SQL-On-Hadoop
Data Management
Compute
Security
Governance
17
On-Premise Reference Architecture
HDFS
(Hadoop Distributed File System)
Data Access
YARN: Data Operating System
Governance
Security&Operation
1. HDFS Quota Management
• Disk Space Quota
• Name Space Quota
2. HDFS Federation
YARN Capacity Scheduler
Impala
Dril
Presto
NoSQL
Solr
SQL
Streaming
Milb
GraphX
Spark
Hive
Pig
Cascading
Tez
Hive
Pig
Giraph
Mahout
Map Reduce
Data Management
Data Lake
OperationsData Science BI & DWH Marketing
18
On-Cloud Design Considerations
Infrastructure Isolation
(Physical & Logical)
Virtualization Using
Multiple VM Support
Cloud Automation &
Integration
Catalogue
Manager
 Infrastructure shared
physically or logically
based on security
expectations (Multi-
tenant environment)
 Complete isolation of
compute using
dedicated cluster or
resource pool
 Logical isolation using
virtual machine level
 Complete network
isolation through
dedicated network for
every tenant and VLAN
for logical isolation
 Infrastructure
components are
virtualized and
managed as a single
entity, and are
simultaneously
isolated based on
tenancy
 Security and
compliance needs met
through anti-
collocation,
hypervisor-level
firewalls, resource
grouping of compute
and storage, and
VLAN-based isolations
 Essential to support
client specific business
processes, identity
management and
integration with tools
and services
 In the multitenant
cloud environment, it
is essential to define
standard processes
and practice for
providing
customization
flexibility
 Each aspect of multi-
tenancy must
culminate in an
intuitive user interface
 Catalogue content
depends on tenancy
and individual
privileges
 Multi-tenancy at a
catalogue level may
authorize integration
with different
directory services for
each tenant
19
On-Cloud Implementation Guideline (1/4)
Key guidelines to achieve multitenant data lake environment:
Storage
Object storage are the key storage units in a cloud data lake. Below are some of the key cloud
providers :
Storage Management
Multitenancy can be achieved using two methods:
 Storage Account: Isolation of storage can be implemented using separate storage account for each
tenant group along with appropriate identity and access management
 Containers / Buckets: By creating separate containers / buckets for each tenant group with
appropriate identity and access management
Cloud Provider AWS Azure Google
Service Name S3 Azure Storage (Blob) Google Cloud storage
Hot S3 Standard Hot Blob Storage GCS
Cool S3 Standard - Infrequent Access Cool Blob Storage GCS Nearline
Archival Glacier Archive Blob storage GCS Coldline
Object Limit Unlimited Unlimited Unlimited
Size Limit 5 TB / Object 500 TB / Account 5 TB / Object
20
Resource Group
Best approach is to create a separate resource group for each LOB tenant
Big Data Processing
Big data processing services provided by key cloud providers:
Data Processing Isolation
 Creating multiple services within the same subscription for each LOB. e.g. operational department
have their own Azure / AWS Databricks service for a given subscription
 Deployment of multiple clusters for each tenant group or user with appropriate identity and
access management
 Data processing results can be stored into the tenant specific storage accounts / buckets
On-Cloud Implementation Guideline (2/4)
AWS Azure Google
Service Name
Elastic MapReduce (EMR)
| AWS Databricks
HDInsight | Azure
Databricks
Google Cloud
Dataproc
21
On-Cloud Implementation Guideline (3/4)
Security, Identity and Access
Cloud service providers are responsible for security of data centers
Application Security Implementation
 Configure AWS / Azure Stack to support users from multiple IAM / Azure Active Directory
(Azure AD) tenants to use services in AWS / Azure Stack
 Provide authorization using AWS Organization / Azure RBAC
 Data encryption can be secured using AWS Key Management Service / Azure Key Vault
Security Features from
Cloud Providers
AWS Azure Google
Authentication &
Authorization
Identity and Access
Management (IAM)
Azure Active
Directory
Google Cloud Identity
and Access Management
AWS Organization Azure RBAC -
Encryption
Server-side Encryption with
Amazon S3 Key
Management Service
Azure Storage Service
Encryption
Google / Customer
Managed Encryption Key
Key Management Service Key Vault -
22
Network Services
Multi-tenant Network Strategy
 Isolate the application servers on their own physical network. This approach works for single
tenants on dedicated servers
 Define Virtual Switches (vSwitches) for each tenant. vSwitches can bring all relevant virtual
machines together on one logical switch. Similar to virtual machines, vSwitches can move within
the cloud environment
 Configure Virtual Local Area Network (VLANs) and create separate network for each tenant
On-Cloud Implementation Guideline (4/4)
Network Services from
Cloud Providers
AWS Azure Google
Cloud Virtual Network
Virtual Private Cloud
(VPC)
Virtual Network
Virtual Private Cloud
(VPC)
Cross Premise
Connectivity
AWS VPN Gateway Azure VPN Gateway Google VPN Gateway
Dedicated Network Direct Connect Express Route Dedicated Interconnect
23
On-Cloud Reference Architecture
Customer DCustomer A Customer B Customer C
Cloud Data Storage
(S3 / Azure Blob, ADLS / GCS)
Data Access
YARN: Data Operating System
Governance
Security&Operation
Data Management
SQL
Streamin
g
Milb
GraphX
Databricks Spark Redshift
Azure
SQLDWH
Elastic DWH
Hive
HBase
Spark
Hadoop
HDInsight / EMR
IaaS
Impala
Dril
Presto
NoSQL
Solr
Load Balancer
VM
VMInstance
1
Instance
2
Instance
n-1
Instance
n
VM
VM
24
 http//archive.gtra.org/files/Multitenancy_and_the_Enterprise_Data_Hub.pdf
 http://thesai.org/Downloads/Volume5No11/Paper_23-A_Hybrid_Multi-
Tenant_Database_Schema_for_Multi-Level_Quality_of_Service.pdf
 https://www.linkedin.com/pulse/multi-tenancy-deployment-options-big-data-design-pattern-
tom-martin/
 https://www.ibm.com/blogs/cloud-computing/2016/08/16/design-considerations-multi-tenant-
cloud/
 https://www.networkworld.com/article/3191520/cloud-computing/deep-dive-on-aws-vs-azure-
vs-google-cloud-storage-options.html
 https://securosis.com/assets/library/reports/Securing_Hadoop_Final_V2.pdf
 https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource-
management/bk_yarn-resource-management.pdf
 http://www.jamesserra.com/archive/2016/07/multi-tenant-databases-in-the-cloud/
 https://blogs.technet.microsoft.com/yungchou/2013/08/08/resource-pooling-virtualization-
fabric-and-cloud/
 http://dataconomy.com/2017/11/building-governed-data-lake-cloud/
References
25
 Enterprise Datalake
 Multitenancy
 HDFS Resource Management
 HDFS Resource Isolation
 HDFS Quota Management
Key Words
26
Thank You
Authors:
Sanjay Upadhyay2
thoughtleaders@citiustech.com
About CitiusTech
3,200+
healthcare IT professionals worldwide
100%
healthcare industry focus
30%+
CAGR over last 5 years
110+
healthcare customers
• Healthcare technology companies
• Hospitals, IDNs & medical groups
• Payers and health plans
• ACOs, MCOs, HIEs, HIXs, NHINs
• Pharma & Life Sciences companies

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Data Mesh for Dinner
Data Mesh for DinnerData Mesh for Dinner
Data Mesh for Dinner
 
adb.pdf
adb.pdfadb.pdf
adb.pdf
 
Azure data platform overview
Azure data platform overviewAzure data platform overview
Azure data platform overview
 
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Modern Data Platform on AWS
Modern Data Platform on AWSModern Data Platform on AWS
Modern Data Platform on AWS
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)Azure Data Lake Intro (SQLBits 2016)
Azure Data Lake Intro (SQLBits 2016)
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Data platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptxData platform modernization with Databricks.pptx
Data platform modernization with Databricks.pptx
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Azure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdfAzure+Databricks+Course+Slide+Deck+V4.pdf
Azure+Databricks+Course+Slide+Deck+V4.pdf
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Data Vault and DW2.0
Data Vault and DW2.0Data Vault and DW2.0
Data Vault and DW2.0
 
Data mesh
Data meshData mesh
Data mesh
 
Why Data Vault?
Why Data Vault? Why Data Vault?
Why Data Vault?
 
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
Pipelines and Data Flows: Introduction to Data Integration in Azure Synapse A...
 
Building the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump InBuilding the Enterprise Data Lake - Important Considerations Before You Jump In
Building the Enterprise Data Lake - Important Considerations Before You Jump In
 

Semelhante a Data Lake - Multitenancy Best Practices

[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
IJET - International Journal of Engineering and Techniques
 

Semelhante a Data Lake - Multitenancy Best Practices (20)

The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
The Shared Elephant - Hadoop as a Shared Service for Multiple Departments – I...
 
Untangling cluster management with Helix
Untangling cluster management with HelixUntangling cluster management with Helix
Untangling cluster management with Helix
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...IRJET-  	  A Study of Comparatively Analysis for HDFS and Google File System ...
IRJET- A Study of Comparatively Analysis for HDFS and Google File System ...
 
Benefits of a data lake
Benefits of a data lake Benefits of a data lake
Benefits of a data lake
 
Challenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBAChallenges Management and Opportunities of Cloud DBA
Challenges Management and Opportunities of Cloud DBA
 
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
Simplifying Your Cloud Architecture with a Logical Data Fabric (APAC)
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
BFC: High-Performance Distributed Big-File Cloud Storage Based On Key-Value S...
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric ArchitectureShaping the Role of a Data Lake in a Modern Data Fabric Architecture
Shaping the Role of a Data Lake in a Modern Data Fabric Architecture
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
EMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data AnalyticsEMC Isilon Multitenancy for Hadoop Big Data Analytics
EMC Isilon Multitenancy for Hadoop Big Data Analytics
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
[IJET-V1I6P11] Authors: A.Stenila, M. Kavitha, S.Alonshia
 
Database Performance Management in Cloud
Database Performance Management in CloudDatabase Performance Management in Cloud
Database Performance Management in Cloud
 
Big Data Analytics With Hadoop
Big Data Analytics With HadoopBig Data Analytics With Hadoop
Big Data Analytics With Hadoop
 
Hadoop project design and a usecase
Hadoop project design and  a usecaseHadoop project design and  a usecase
Hadoop project design and a usecase
 

Mais de CitiusTech

Mais de CitiusTech (20)

Member Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health PlansMember Engagement Using Sentiment Analysis for Health Plans
Member Engagement Using Sentiment Analysis for Health Plans
 
Evolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in HealthcareEvolving Role of Digital Biomarkers in Healthcare
Evolving Role of Digital Biomarkers in Healthcare
 
Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations Virtual Care: Key Challenges & Opportunities for Payer Organizations
Virtual Care: Key Challenges & Opportunities for Payer Organizations
 
Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)Provider-led Health Plans (Payviders)
Provider-led Health Plans (Payviders)
 
CMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An AnalysisCMS Medicare Advantage 2021 Star Ratings: An Analysis
CMS Medicare Advantage 2021 Star Ratings: An Analysis
 
Accelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOpsAccelerate Healthcare Technology Modernization with Containerization and DevOps
Accelerate Healthcare Technology Modernization with Containerization and DevOps
 
FHIR for Life Sciences
FHIR for Life SciencesFHIR for Life Sciences
FHIR for Life Sciences
 
Leveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk PatientsLeveraging Analytics to Identify High Risk Patients
Leveraging Analytics to Identify High Risk Patients
 
FHIR Adoption Framework for Payers
FHIR Adoption Framework for PayersFHIR Adoption Framework for Payers
FHIR Adoption Framework for Payers
 
Payer-Provider Engagement
Payer-Provider Engagement Payer-Provider Engagement
Payer-Provider Engagement
 
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
COVID19: Impact & Mitigation Strategies for Payer Quality Improvement 2021
 
Demystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation TestingDemystifying Robotic Process Automation (RPA) & Automation Testing
Demystifying Robotic Process Automation (RPA) & Automation Testing
 
Progressive Web Apps in Healthcare
Progressive Web Apps in HealthcareProgressive Web Apps in Healthcare
Progressive Web Apps in Healthcare
 
RPA in Healthcare
RPA in HealthcareRPA in Healthcare
RPA in Healthcare
 
6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP6 Epilepsy Use Cases for NLP
6 Epilepsy Use Cases for NLP
 
Opioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and FutureOpioid Epidemic - Causes, Impact and Future
Opioid Epidemic - Causes, Impact and Future
 
Rising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes ResearchRising Importance of Health Economics & Outcomes Research
Rising Importance of Health Economics & Outcomes Research
 
ICD 11: Impact on Payer Market
ICD 11: Impact on Payer MarketICD 11: Impact on Payer Market
ICD 11: Impact on Payer Market
 
Testing Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on HadoopTesting Strategies for Data Lake Hosted on Hadoop
Testing Strategies for Data Lake Hosted on Hadoop
 
Driving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data AnalyticsDriving Home Health Efficiency through Data Analytics
Driving Home Health Efficiency through Data Analytics
 

Último

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Data Lake - Multitenancy Best Practices

  • 1. This document is confidential and contains proprietary information, including trade secrets of CitiusTech. Neither the document nor any of the information contained in it may be reproduced or disclosed to any unauthorized person under any circumstances without the express written permission of CitiusTech. Data Lake – Multitenancy Best Practices 30 November, 2018 | Author: Sanjay Upadhyay; Sr. Solution Architect CitiusTech Thought Leadership
  • 2. 2 Objective  Multitenancy in a data lake allows organizations to share their cluster resources across user communities without impacting business SLAs and capabilities or security and privacy needs  This document covers guidelines around achieving multitenancy in a data lake environment  It mentions the different design and implementation guidelines necessary for on premise as well as cloud-based multitenant data lake, and highlights the reference architecture for both these deployment options
  • 3. 3 Agenda  Introduction  Key Drivers for Data Lake Multitenancy  On-Premise Data Lake Multitenancy • Design Considerations • Implementation Guidelines • Reference Architecture  On-Cloud Data Lake Multitenancy • Design Considerations • Implementation Guidelines • Reference Architecture
  • 4. 4 Introduction  In modern data management infrastructure, data lake is an important repository that holds a vast amount of data in its native format until it is needed. An Enterprise Data Lake (EDL) can not only store and share enterprise-wide information but is also capable of performing a variety of enterprise workload activities like batch processing, streaming processing, interactive SQL, enterprise search and advanced analytics  The adoption of data lakes is increasing and organizations are leveraging data lakes for numerous use cases. As usage increases, operational challenges like resource clogging also increase, resulting in failure of critical business processes and impacting service level agreements (SLAs)  Earlier, most data lake implementations were on-premise. However, organizations today are leveraging cloud technology to replace or expand their existing data lake implementations  Data lake multi-tenancy is an important architectural paradigm that enables multiple business users and processes to share a common set of resources, such as Apache Hadoop clusters.  This includes setting up appropriate policies around resource provisioning and access while meeting SLAs and security requirements for each tenant  There are different sets of guidelines to achieve multitenancy on-premise as well as on-cloud, some of the key design principle and implementation guidelines are described in this document
  • 5. 5  Resource and Cost Optimization: Having multiple business units share the same cluster resources brings significant cost savings (across hardware and operational costs)  Collaboration and Decision Making: Multi-tenancy helps achieve data sharing and collaboration between different teams and integrate multiple data silos to get a unified view of data  Operational Simplicity: Sharing an cluster among multiple users makes it dramatically easier to deploy and manage  Wide Audience: A multi-tenant, cloud based cluster enables a wide range of stakeholders (developers, analysts, data scientists) from different organizational units to access and use the data  Seamless Scalability: A multi-tenant model makes it easy for IT teams to scale-out clusters to meet changing business needs (e.g., business growth, new markets, reorganizations, mergers and acquisitions, etc.) Key Drivers for Data Lake Multitenancy
  • 6. 6 On-Premise Design Considerations Key design considerations to achieve data lake multitenancy: Defining the Tenant Identification of different business units that would access the cluster. All business units should have defined use cases to leverage enterprise data lake Selecting the Right Isolation Model Strategic cluster design can be done using one of three key architectural models - Share Nothing, Shared Management and Shared Resources Defining the Resource Usage Agreement Ensure complete coverage of utilization and SLA requirements across storage, compute, data governance, tracking, auditing and on-boarding
  • 7. 7 On-Premise Design: Isolation Models Model 1: Share Nothing  Cluster management and data are segregated  Does not leverage multitenancy , but can be used by IT teams based on operational realities and governance policies Model 2: Shared Management  Cluster management is shared, while data and resources are separated for tenant groups  Useful for high-priority clusters that can’t afford to risk any performance issues or resource contentions Model 3: Shared Resources  Leverages the multitenancy benefits from consolidated cluster management, shared data and resources  Isolation model is recommended for development of enterprise data lakes Cluster Manager Tenant A Tenant B Cluster Manager Tenant A Tenant B Cluster Manager Tenant B Cluster Manager Tenant A
  • 8. 8 On-Premise Design: Resource Usage Agreement Key Components Storage  Every tenant group on the cluster should have access to it’s section (namespace / directory)  A dedicated directory should be assigned to users in a tenant group to store data  The storage quota needs to be controlled to ensure work of other groups and users isn’t impacted due to influx of large data Compute  All tenant groups should be guaranteed an agreed upon minimum compute power at all times  Recommended to provide compute power at tenant group level irrespective of the tenant users Data Governance  Define metadata management and data lineage Tracking & Auditing  Track the cluster access and generate report  Audit data asset accessed with other metadata like access time, IP address etc. On-boarding  Design the hierarchy of the tenant groups and service accounts  The process of adding tenant groups and users to the cluster should be straightforward
  • 9. 9 On-Premise Implementation Guidelines (1/8) Hadoop Distributed File System (HDFS) Resource Management HDFS is the key storage unit in data lake and is shared by all the tenant groups, users and service accounts for processing jobs. HDFS storage is broadly classified into three categories:  LOB Space: Space allocated to particular line of business like Finance, Marketing, etc.  User Space: Dedicated space for individual users for development / experimentation  Enterprise Space: This layer stores all datasets (raw or processed) used by multiple business groups HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 10. 10 On-Premise Implementation Guidelines (2/8) Storage Structure The structure of HDFS storage should be simple and clearly isolate the different business units’ raw data. e.g. Storage Structure HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 11. 11 On-Premise Implementation Guidelines (3/8) HDFS Quota Management HDFS supports two quota mechanisms that administrators can utilize to manage space usage by cluster tenants: Disk Space Quotas  Sets disk space limits on a per-directory basis  Prevents users from accidentally or maliciously consuming excess disk space within the cluster Name Space Quotas  Limits the number of files or subdirectories within a particular directory  Helps administrators optimize the metadata subsystem (NameNode) within the Hadoop cluster HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 12. 12 On-Premise Implementation Guidelines (4/8) HDFS Resource Isolation  HDFS resource isolation is achieved using HDFS federation (HDFS 2.X)  HDFS federation in Hadoop uses multiple independent namenodes / namespaces to horizontally scale the name service  All data nodes are used as common storage for blocks by all the name nodes. Each data node registers with all the name nodes in the cluster Limitations of Single Namespace / Namenode  Namespace and block storage are tightly coupled  The namespace isn’t scalable like data node. Horizontal scalability in HDFS cluster is achievable with the addition of more data nodes  Hadoop’s performance depends on throughput of the namenode. Operation of current file system depends on the throughput of a single namenode  There is no separation of namespaces. This results in no isolation among tenant organizations which are using the cluster Benefits of HDFS Federation  No isolation in a single namenode in a multi-user environment. Different categories of applications and users can be put into multiple namespaces by using multiple namenodes  Namenodes in federation scale up horizontally in the file system’s namespace  Read / Write throughput can be improved by adding more namenodes HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 13. 13 On-Premise Implementation Guidelines (5/8) SQL-On-Hadoop Data Management There are three different ways to manage multi-tenant data on SQL- on-Hadoop database. The data is eventually stored on HDFS but can be viewed using SQL tools like Hive, HAWQ, IMPALA etc.  Separated Databases: Storing tenant data in separate databases is the simplest approach to data isolation  Shared Database and Separate Schema: This approach involves housing multiple tenants in the same database. Each tenant has its own set of tables that are grouped into a schema created specifically for that tenant  Shared Database and Schema: The same database and same set of tables host multiple tenants' data. Each table has records from multiple tenant and is segregated by tenant’s id column value Separate Schema Shared Schema Separated DatabasesIsolated Shared Implementing Shared Database and Separate Schema is recommended. HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 14. 14 On-Premise Implementation Guidelines (6/8) Compute  YARN Capacity Scheduler is used as a resource management application to allocate shared cluster resources among users and groups  The queue is an important component of scheduling in YARN and for isolating resources. Important queue properties are: • Queue name • Queue path name • Associated child queue and application • Minimum-maximum capacity of the queue  Resources can be allocated with the Capacity Scheduler by: • Enabling the Capacity Scheduler • Setting up Queues • Controlling access to queues with ACLs (Access Control List) Root Finance Marketing Operations HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance e.g. Setting up a queue hierarchy
  • 15. 15 On-Premise Implementation Guidelines (7/8) HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance Security Authentication  Need to know who users are and groups they belong to  Kerberos is the only authentication method supported by most components Authorization  Need to know what users can access and what level of permissions they have  In a multi-tenant environment, service accounts / IDs need to be setup at the group and enterprise levels  Service accounts exist in the enterprise identity store, and are provided to end users using Apache Ranger / Sentry Auditing  Determines who did what and when  Apache Ranger / Sentry or custom solution for auditing Data Protection  Data in Transit • SSL / TLS needs to be enabled to encrypt data between clients and service endpoints • Keys / certificates configured as per service / role  Data at Rest • Multiple encryption zones on HDFS allow only authorized users to access data • Data is transmitted in an encrypted form as encryption is on HDFS block level • Keys can be stored in Java keystore or HSM
  • 16. 16 On-Premise Implementation Guidelines (8/8) Governance Apace Atlas can be used for managing the metadata of the data assets and keep track of the lineage information. HDFS Resource Management Storage Structure HDFS Quota Management HDFS Resource Isolation SQL-On-Hadoop Data Management Compute Security Governance
  • 17. 17 On-Premise Reference Architecture HDFS (Hadoop Distributed File System) Data Access YARN: Data Operating System Governance Security&Operation 1. HDFS Quota Management • Disk Space Quota • Name Space Quota 2. HDFS Federation YARN Capacity Scheduler Impala Dril Presto NoSQL Solr SQL Streaming Milb GraphX Spark Hive Pig Cascading Tez Hive Pig Giraph Mahout Map Reduce Data Management Data Lake OperationsData Science BI & DWH Marketing
  • 18. 18 On-Cloud Design Considerations Infrastructure Isolation (Physical & Logical) Virtualization Using Multiple VM Support Cloud Automation & Integration Catalogue Manager  Infrastructure shared physically or logically based on security expectations (Multi- tenant environment)  Complete isolation of compute using dedicated cluster or resource pool  Logical isolation using virtual machine level  Complete network isolation through dedicated network for every tenant and VLAN for logical isolation  Infrastructure components are virtualized and managed as a single entity, and are simultaneously isolated based on tenancy  Security and compliance needs met through anti- collocation, hypervisor-level firewalls, resource grouping of compute and storage, and VLAN-based isolations  Essential to support client specific business processes, identity management and integration with tools and services  In the multitenant cloud environment, it is essential to define standard processes and practice for providing customization flexibility  Each aspect of multi- tenancy must culminate in an intuitive user interface  Catalogue content depends on tenancy and individual privileges  Multi-tenancy at a catalogue level may authorize integration with different directory services for each tenant
  • 19. 19 On-Cloud Implementation Guideline (1/4) Key guidelines to achieve multitenant data lake environment: Storage Object storage are the key storage units in a cloud data lake. Below are some of the key cloud providers : Storage Management Multitenancy can be achieved using two methods:  Storage Account: Isolation of storage can be implemented using separate storage account for each tenant group along with appropriate identity and access management  Containers / Buckets: By creating separate containers / buckets for each tenant group with appropriate identity and access management Cloud Provider AWS Azure Google Service Name S3 Azure Storage (Blob) Google Cloud storage Hot S3 Standard Hot Blob Storage GCS Cool S3 Standard - Infrequent Access Cool Blob Storage GCS Nearline Archival Glacier Archive Blob storage GCS Coldline Object Limit Unlimited Unlimited Unlimited Size Limit 5 TB / Object 500 TB / Account 5 TB / Object
  • 20. 20 Resource Group Best approach is to create a separate resource group for each LOB tenant Big Data Processing Big data processing services provided by key cloud providers: Data Processing Isolation  Creating multiple services within the same subscription for each LOB. e.g. operational department have their own Azure / AWS Databricks service for a given subscription  Deployment of multiple clusters for each tenant group or user with appropriate identity and access management  Data processing results can be stored into the tenant specific storage accounts / buckets On-Cloud Implementation Guideline (2/4) AWS Azure Google Service Name Elastic MapReduce (EMR) | AWS Databricks HDInsight | Azure Databricks Google Cloud Dataproc
  • 21. 21 On-Cloud Implementation Guideline (3/4) Security, Identity and Access Cloud service providers are responsible for security of data centers Application Security Implementation  Configure AWS / Azure Stack to support users from multiple IAM / Azure Active Directory (Azure AD) tenants to use services in AWS / Azure Stack  Provide authorization using AWS Organization / Azure RBAC  Data encryption can be secured using AWS Key Management Service / Azure Key Vault Security Features from Cloud Providers AWS Azure Google Authentication & Authorization Identity and Access Management (IAM) Azure Active Directory Google Cloud Identity and Access Management AWS Organization Azure RBAC - Encryption Server-side Encryption with Amazon S3 Key Management Service Azure Storage Service Encryption Google / Customer Managed Encryption Key Key Management Service Key Vault -
  • 22. 22 Network Services Multi-tenant Network Strategy  Isolate the application servers on their own physical network. This approach works for single tenants on dedicated servers  Define Virtual Switches (vSwitches) for each tenant. vSwitches can bring all relevant virtual machines together on one logical switch. Similar to virtual machines, vSwitches can move within the cloud environment  Configure Virtual Local Area Network (VLANs) and create separate network for each tenant On-Cloud Implementation Guideline (4/4) Network Services from Cloud Providers AWS Azure Google Cloud Virtual Network Virtual Private Cloud (VPC) Virtual Network Virtual Private Cloud (VPC) Cross Premise Connectivity AWS VPN Gateway Azure VPN Gateway Google VPN Gateway Dedicated Network Direct Connect Express Route Dedicated Interconnect
  • 23. 23 On-Cloud Reference Architecture Customer DCustomer A Customer B Customer C Cloud Data Storage (S3 / Azure Blob, ADLS / GCS) Data Access YARN: Data Operating System Governance Security&Operation Data Management SQL Streamin g Milb GraphX Databricks Spark Redshift Azure SQLDWH Elastic DWH Hive HBase Spark Hadoop HDInsight / EMR IaaS Impala Dril Presto NoSQL Solr Load Balancer VM VMInstance 1 Instance 2 Instance n-1 Instance n VM VM
  • 24. 24  http//archive.gtra.org/files/Multitenancy_and_the_Enterprise_Data_Hub.pdf  http://thesai.org/Downloads/Volume5No11/Paper_23-A_Hybrid_Multi- Tenant_Database_Schema_for_Multi-Level_Quality_of_Service.pdf  https://www.linkedin.com/pulse/multi-tenancy-deployment-options-big-data-design-pattern- tom-martin/  https://www.ibm.com/blogs/cloud-computing/2016/08/16/design-considerations-multi-tenant- cloud/  https://www.networkworld.com/article/3191520/cloud-computing/deep-dive-on-aws-vs-azure- vs-google-cloud-storage-options.html  https://securosis.com/assets/library/reports/Securing_Hadoop_Final_V2.pdf  https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_yarn-resource- management/bk_yarn-resource-management.pdf  http://www.jamesserra.com/archive/2016/07/multi-tenant-databases-in-the-cloud/  https://blogs.technet.microsoft.com/yungchou/2013/08/08/resource-pooling-virtualization- fabric-and-cloud/  http://dataconomy.com/2017/11/building-governed-data-lake-cloud/ References
  • 25. 25  Enterprise Datalake  Multitenancy  HDFS Resource Management  HDFS Resource Isolation  HDFS Quota Management Key Words
  • 26. 26 Thank You Authors: Sanjay Upadhyay2 thoughtleaders@citiustech.com About CitiusTech 3,200+ healthcare IT professionals worldwide 100% healthcare industry focus 30%+ CAGR over last 5 years 110+ healthcare customers • Healthcare technology companies • Hospitals, IDNs & medical groups • Payers and health plans • ACOs, MCOs, HIEs, HIXs, NHINs • Pharma & Life Sciences companies