Organizations are increasingly exploring lakehouse architectures with Databricks to combine the best of data lakes and data warehouses. Databricks SQL Analytics introduces new innovation on the “house” to deliver data warehousing performance with the flexibility of data lakes. The lakehouse supports a diverse set of use cases and workloads that require distinct considerations for data access. On the lake side, tables with sensitive data require fine-grained access control that are enforced across the raw data and derivative data products via feature engineering or transformations. Whereas on the house side, tables can require fine-grained data access such as row level segmentation for data sharing, and additional transformations using analytics engineering tools. On the consumption side, there are additional considerations for managing access from popular BI tools such as Tableau, Power BI or Looker.
The product team at Immuta, a Databricks partner, will share their experience building data access governance solutions for lakehouse architectures across different data lake and warehouse platforms to show how to set up data access for common scenarios for Databricks teams new to SQL Analytics.
2. Agenda
Introduction to
Lakehouse Concepts for
Governance
Role-Based Access
Control (RBAC) vs.
Attribute-Based Access
Control (ABAC)
Enterprise-Grade
Authorization in
Databricks SQL Analytics
5. What is a Lakehouse?
■ Let’s do a (brief) history lesson
■ Late 1980’s: the Data Warehouse
■ Early 2010’s: the Data Lake
■ The roaring 20’s: the Data Lakehouse
6. Key Features of the Lakehouse
Transaction support
Schema enforcement and
governance
BI support
Separate storage from compute
Support for diverse workloads
Scalable security and access control
management
Additional data governance
capabilities such as auditing and
lineage
Data discovery tools such as data
catalogs
Enterprise-Grade Features
Basic Key Attributes
10. Role-Based Access Control (RBAC)
To manage access to resources, group permissions into roles, and assign those roles to users
■ User-Role relationships
■ Role-Permission relationships
11. Role-Based Access Control (RBAC)
Define a User-Role relationship in Databricks SQL Analytics
■ Manage groups using the Admin Console, Groups API,
or SCIM API
■ Add users to groups and remove them
12. Role-Based Access Control (RBAC)
Define a Role-Permission relationship in Databricks SQL Analytics
■ Define the access that a role grants to a user
■ At a high level this can be implemented in terms of
the is_member() function
14. Attribute-Based Access Control (ABAC)
Represent fine-grained or dynamic permissions based on who the user is and their relationship to the
resource they want to access.
■ User relationship to the resource can be expressed
as a JOIN on user attributes and values of a resource
column
16. Access Control Dimensions
A user can access sales data,
but not financial data
A user can access a particular
sales opportunity, or a sales
opportunity matching certain
conditions
Row
Table
A user can access only certain
fields of a record, and we can
mask the values of a column
depending on the user trying
to access
Column
17. Me
Just now
You’re going to need a
framework to manage all of
these access controls across
your Enterprise.
18. Requirements for Enterprise-Grade Access Controls
Framework
Individuals can be granted
access to query tables and
views by virtue of:
● membership in a group
(role-based)
● possession of an attribute
(attribute-based)
● request and approval by an
admin
● public access
● individual user selection
● access for a specified
period of time
● access only for a specific
purpose
Individuals can be allowed to see
rows in a dataset based on:
● membership in a group with a
corresponding column value
with that group
● possession of an attribute with
a corresponding column value
with that attribute
● filter based on a time column,
so users are entitled to query
only rows with a specific
recency requirement
Row-level policies
Table-level policies
Different users see different
values in specific columns by
virtue of the above discussed
roles, attributes, and purposes;
examples include:
● Masking a column to NULL
● Masking a column using
hashing
● Masking a column to a
constant string
● Other advanced PETs and
Differential Privacy
Column-level policies
19. Users who are part of the Active Directory
group called finance are allowed to read
profit loss data.
Provided we’ve kept our groups in sync
between our corporate directory and
Databricks, using either the Admin Console,
Groups API, or SCIM API, then we can solve
this requirement simply with:
GRANT SELECT ON TABLE
accounting.profit_loss_statement
TO finance;
Framework for Managing Table-level Access Controls
Users with the attribute executive are
allowed to read sales data.
This one is a bit more complex. First, we
need to store a (user, name, value) triple
in some sort of attributes table.
Next, we’ll actually need to create a
secure view on top of the original table,
since we can’t pass a WHERE clause as a
principle, only user or group.
ABAC
RBAC
20. Solving for ABAC in our Framework
Users with the attribute executive are allowed to read sales data.
21. Solving for ABAC in our Framework
Restrict the user to only be able to view their own personal attributes.
22. Solving for ABAC in our Framework
Putting it all together. Users with the attribute executive are allowed to read sales data.
23. Managing Row-level Access Controls
A user can access a particular sales opportunity, or a sales opportunity matching certain conditions.
■ Let’s consider a sales dataset that has a territory
column, and we only want users with the attribute
territory to be able to see rows with the
corresponding value in the territory column
29. sec_fct_sales
visible sale_id amount territory
YES 1 1000000 US-EAST
YES 2 150000 US-EAST
NO 3 175000 EU
NO 4 800000 APAC
NO 5 50000 US-WEST
NO 6 75000 US-CENTRAL
YES 7 50000 US-EAST
30. sec_fct_sales (for user without the executive attribute)
visible sale_id amount territory
YES 1 NULL US-EAST
YES 2 NULL US-EAST
NO 3 NULL EU
NO 4 NULL APAC
NO 5 NULL US-WEST
NO 6 NULL US-CENTRAL
YES 7 NULL US-EAST
31. Thanks for coming to my
talk. My name is Zachary
and I’m a product
manager at Immuta,
which provides an
Enterprise-grade access
controls platform to Data
teams just like this. AMA!
Thank You!