This website stores cookies on your computer. These cookies are used to improve and customize your browsing experience and for analytics and metrics about our visitors on this website.

For more details, review our Privacy Policy.

What You Need to Know About Data Governance in Azure Databricks

By Valdas Maksimavicius

Published in Data Governance

April 22, 2021

5 min read

What You Need to Know About Data Governance in Azure Databricks

This article gives you more inputs on how to get started with Databricks and shows the direction for further improvements.

You’ll get to know how to tackle the typical data governance challenges:

Databricks access controls (users, groups, tokens, etc.)
Data access controls (credential passthrough, ACLs, service principals, etc.)
Audit & logging
Data management (discoverability, quality, lineage)
GDPR compliance
Cost Management
Infrastructure security

Databricks access controls

Users, tokens & user groups

There are a few ways to manage Databricks access:

Azure Active Directory users

Access to Databricks is granted by entering employee’s email. All users have to be present in Azure Active Directory. Also, it’s a viable option for system users too.

Databricks groups are not connected to Azure Active Directory. Hence it’s required to define groups and assign users. You might consider applying similar name as your Azure Active Directory groups. These groups are crucial while setting up access on clusters, notebooks, etc.

Tokens

Personal Access Tokens can secure authentication to the Databricks API instead of passwords.

Clusters & workspace objects

Setting up access on clusters, notebooks, MLFlow experiments is straight forward:

To ensure you give proper permission, take a look at detailed documentation.

Secret scopes a.k.a key vaults

Use secret scopes to hide password and secrets inside notebooks. Good news that you can use these secrets also in cluser configurations.

jdbcUsername = dbutils.secrets.get(scope = "jdbc", key = "username")
jdbcPassword = dbutils.secrets.get(scope = "jdbc", key = "password")

There are two types of scopes: Azure Key Vault-backed scopes and Databricks-backed scopes. Azure Key Vault is often a better choice as it gives more control over your secrets. You can read more them here.

Cluster policies

Be careful with who creates and manages clusters. A cluster policy limits the ability to configure clusters based on a set of rules.

Limit users to create clusters with prescribed settings
Simplify the user interface and enable more users to create their own clusters (by fixing and hiding some values)
Control cost by limiting per cluster maximum cost

SCIM Integration

Azure Databricks supports SCIM, or System for Cross-domain Identity Management, an open standard that allows you to automate user provisioning using a REST API and JSON.

Using SCIM you create users, user groups and service principals in Databricks. Give them the proper level of access, temporarily lock and unlock user accounts, and remove access.

Example request to add Service Principal

POST /api/2.0/preview/scim/v2/ServicePrincipals HTTP/1.1
Host: <databricks-instance>
Authorization: Bearer dapi48…a6138b
Content-Type: application/scim+json

{
  "schemas":[
    "urn:ietf:params:scim:schemas:core:2.0:ServicePrincipal"
  ],
  "applicationId":"b4647a57-063a-43e3-a6b4-c9a4e9f9f0b7",
  "displayName":"test-service-principal",
  "groups":[
    {
       "value":"123456"
    }
  ],
  "entitlements":[
    {
       "value":"allow-cluster-create"
    }
  ]
}

Data access controls

Implement table access control

By default, all users have access to all data stored in a cluster’s managed tables. To limit that, one can use table access control to set permissions for data objects on a cluster. Also, one can create dedicated Hive views and apply Row Level Security. Read more about Table access control

Secure access to Azure Data Lake Storage

First, setup permissions on Azure Data Lake Gen 2 using ACLs. Read more about Data Lake Gen 2 ACLs

Credential passthrough

Use Azure Active Directory credential passthrough to enable access for your users. Ensure the user’s email used to log in to Databricks and set ACLs is the same. Read more about accessing data from ADLS using Azure AD

p.s. PowerBI Databricks connects supports credential passthrough too.

Enable Credential Passthrough on a cluster

Don’t forget that you can mount a storage and still use credential passthrough - read more.

Be aware that credential passthrough has many limitations: disabled Scala, unsupported Data Factory, no DBFS, and more.

Service principals

As stated above, Data Factory does not work with credential passthrough. For long-running or frequent workloads, automated jobs, you still want to use system users.

spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net", dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"))
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

SQL Databases

Usernames and passwords are required to access sources like Azure SQL, SQL Server, MySQL, PostgreSQL. Key Vault is a good place to store such credentials.

jdbcUrl = "jdbc:mysql://{0}:{1}/{2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
  connectionProperties = {
    "user" : jdbcUsername,
    "password" : jdbcPassword,
    "driver" : "com.mysql.jdbc.Driver"
}

Big Query

Other storages, like BigQuery, might require key-based authentication. Cluster based environment variables would do the trick

GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/service-account-file.json"

And that is enough to access your data within the cluster.

df = spark.read \
  .format("bigquery") \
  .load("bigquery-public-data.samples.shakespeare")

Enhanced access control

Default Databricks governance controls might not be sufficient for large enterprises. You might check out COTS tools, like Immuta, Privacera, Okera. Take a look at my exploration of Immuta.

Audit & logging

Always enable diagnostic logging for Databricks and other services. It delivers plenty of valuable information out-of-the-box. And allows custom logging too.

Information groups logged by Databricks:

dbfs
clusters
accounts
jobs
notebook
ssh
workspace
secrets
sqlPermissions
instancePools

Information groups logged by Azure Data Lake:

StorageRead
StorageWrite
StorageDelete
Transaction

Data Management

Data discovery

Databricks provides table and database search functionality. But, it isn’t enough for a complete data discovery and documentation.

Azure Purview is an Apache Atlas based Azure data catalog, but it’s still in early phases of development.

If you look for an enterprise ready data catalog, take a look at Alation:

Lineage

Getting automated Spark lineage delivered to your data catalog is tough, but not impossible. Here are lineage example with Spline library - article 1, article 2.

Another option is to push manual lineage, visualized here.

Also, consider using Azure Data Factory for orchestration. With a proper ADF pipeline structure, you can have a high level lineage and help you see dependencies and rerun failed activities. You can read more about Data Factory dependency framework here

Data quality

By default Databricks doesn’t provide any features here. But good news are there are great packages available.

For example, Great Expectations, Deequ.

Also, there might be no need for any fancy data quality libraries. Simple asserts can help you control data better.

Below are a few examples about GDPR and CCPA compliance. For a complete Azure Databricks GDPR compliance look here.

Implementing “right to be forgotten”

Delta Lake supports several statements to facilitate deleting data from and updating data in Delta tables - read more.

MERGE INTO users
USING opted_out_users
ON opted_out_users.userId = users.userId
WHEN MATCHED THEN DELETE

Retention

By default, Delta Lake retains table history for 30 days and makes it available for “time travel” and rollbacks. That means that, even after you have deleted personal information from a Delta table, users in your organization may be able to view that historical data

To delete all customers who requested that their information be deleted, and then remove all table history older than 7 days, you simply run:

VACUUM gdpr.customers

Privacy protection patterns

Unfortunately, there are no built-in data anonymization or encryption patters. It’s up to you to implement it based on privacy pattern that suits your scenario best:

Privacy protection at the ingress

Scramble on arrival
Simple to implement
Limits incoming data = limited value extraction

Example - load data from a source through a SQL view that limits sensitive data

CREATE VIEW masked.dwh_customer AS
SELECT 
  id,
  (CASE WHEN email IS NULL THEN 0 ELSE 1 END) AS hasEmail,
  postalCode,
  customerType
FROM dwh.customer

Privacy protection at the egress

Processing in an opaque box
Enabling more use cases
Strict export operations required
Exploratory analytics need explicit egress/classification

Anonymisation

Discard all PII (e.g., user id)
No link between records or datasets

Example - drop sensitive columns with Apache Spark

df = spark.read.parquet("/mnt/data/customer")
df = df.drop("nin", "address", "phoneNumber")

Further improvement would be to get a list of sensitive columns for each dataset and skip it automatically.

Pseudonymization

Records and datasets are linked Hash PII

Example - hash sensitive column

UPDATE gdpr.customers_lookup SET c_email_address_pseudonym = sha2(c_email_address,256)

Cost Management

Get familiar with pricing, cluster differences. Probably the biggest mistake you can do is to use all-purpose computes for all your activities.

If you use more clusters than 1, put cluster tags to identify your workloads and filter in Azure Cost Management.

Infrastructure security

Make sure to add security elements (e.g. NSG, ExpressRoute). Here is an Azure Databricks - Bring Your Own VNet reference architecture:

databricks arch

As you can see, setting up Data Governance in Databricks is not straighforward. There are many moving parts that require custom implementation.

Valdas Maksimavicius

Data & Analytics Leader

Topics

Apache Ranger Evaluation for Cloud Migration and Adoption Readiness

Valdas Maksimavicius

May 24, 2021

15 min

Data Governance From an Engineering Perspective

Valdas Maksimavicius

August 06, 2020

2 min

Azure Synapse Serverless vs Databricks SQL Analytics (as of August 2021)

Valdas Maksimavicius

March 03, 2022

8 min

Launching Databricks at If Insurance | Medium

Valdas Maksimavicius

April 24, 2021

1 min

Making Data Scientists Productive in Azure

Valdas Maksimavicius

March 21, 2021

8 min

What is Azure Data Platform?

Valdas Maksimavicius

March 21, 2021

1 min

Quick Links

About me Privacy Contact Landscape

Other projects

Data Landscape Project Personal Blog

Social Media

linkedin youtube github twitter

What You Need to Know About Data Governance in Azure Databricks

Databricks access controls

Users, tokens & user groups

Azure Active Directory users

Databricks groups

Tokens

Clusters & workspace objects

Secret scopes a.k.a key vaults

Cluster policies

SCIM Integration

Example request to add Service Principal

Data access controls

Implement table access control

Secure access to Azure Data Lake Storage

Credential passthrough

Service principals

SQL Databases

Big Query

Enhanced access control

Audit & logging

Data Management

Data discovery

Lineage

Data quality

Implementing “right to be forgotten”

Retention

Privacy protection patterns

Privacy protection at the ingress

Privacy protection at the egress

Anonymisation

Pseudonymization

Cost Management

Infrastructure security

Tags

Share

Valdas Maksimavicius

Data & Analytics Leader

Topics

Related Posts