Skip to main content

Data & Intelligence

Data Lake Governance with Tagging in Databricks Unity Catalog

Istock 960790462 (1)

The goal of Databricks Unity Catalog is to provide centralized security and management to data and AI assets across the data lakehouse. Unity Catalog provides fine-grained access control for all the securable objects in the lakehouse; databases, tables, files and even models. Gone are the limitations of the Hive metadata store. The Unity Catalog metastore manages all data and AI assets across different workspaces and storage locations. Providing this level of access control substantially increases the quality of governance while reducing the workload involved. There is an additional target of opportunity with tagging.

Tagging Overview

Tags are metadata elements structured as key-value pairs that can be attached to any asset in the lakehouse. Tagging can make these assets more seachable, manageable and governable. A well-structured, well-executed tagging strategy can enhance data classification, enable regulatory compliance and streamline data lifecycle management. The first step is to identify a use case that could be used as a Proof of Value in your organization. A well-structured tagging strategy means that you will need buy-in and participation from multiple stakeholders, include technical resources, SMEs and a sponsor. These are five common use cases for tagging that might find some traction in a regulated enterprise because they can usually be piggy-backed off an existing or upcoming initiative:

  • Data Classification and Security
  • Data Lifecycle Management
  • Compliance and Regulation
  • Project Management and Collaboration

Data Classification and Security

There is always room for an additional mechanism to help safely manage PII (personally identifiable information). A basic initial implementation of tagging could be as simple as applying a PII tag to classify data based on sensitivity. These tags can then be integrated with access control policies in Unity Catalog to automatically grant or restrict access to sensitive data. Balancing the promise of data access in the lakehouse with the regulatory realities surrounding sensitive data is always difficult. Additional tools are always welcome here.

Data Lifecycle Management

Some organizations struggle with the concept of managing different environments in Databricks. This is particularly true when they are moving from a data landscape where there were specific servers for each environment. Tags can be used to identify stages (ex: dev, test, and prod). These tags can then be leveraged to implement policies and practices around moving data through different lifecycle stages. For example, masking policies or transformation steps may be different between environments. Tags can also be used to facilitate rules around deliberate destruction of sensitive data. Geo-coding data with tags to comply with European regulations is also a possible target of opportunity.

Data Cataloging and Discovery

There can be a benefit in attaching descriptive tags directly to the data for cataloging and discovery even if you are already using an external tool. Adding descriptive tags like ‘customer’ or ‘marketing’ directly to the data assets themselves can make it more convenient for analysts and data scientist to perform searches and therefore more likely to be actually used.

Compliance and Regulation

Data Intelligence - The Future of Big Data
The Future of Big Data

With some guidance, you can craft a data platform that is right for your organization’s needs and gets the most return from your data capital.

Get the Guide

This is related to, and can be used in conjunction with, data classification and security. Applying tags such as ‘GDPR’ or ‘HIPAA’ can make performing audits for regulators much simpler. These tags can be used in conjunction with security tags. In an increasing regulated data environment, it pays to make your data assets easy to regulate.

Project Management and Collaboration

This tagging strategy can be used to organize data assets based on project, teams or departments. This can facilitate project management and improve collaboration by identifying which organizational unit owns or is working with a particular data asset.

Implementation

There are some practical considerations when implementing a tagging program:

  • each securable object has a limit of twenty tags
  • the maximum length of a tag is 255 characters, with no special characters allowed
  • you can only search by using exact match (pattern-matching would have really been nice here)

A well-executed tagging strategy will involve some level of automation. It is possible to manage tags in the Catalog Explorer. This can be a good way to kick the tires in the very beginning but automation is critical for a consistent, comprehensive application of the tagging strategy. Good governance is automated. While tagging is available to all securable objects, you will likely start out applying tags to tables.

The information schema tables will have the tag information. However, Databricks Runtime 13.3 and above allows tag management through SQL commands. This is the preferred mechanism because it is so much easier to use than querying the information schema. Regardless of the mechanism used, a user must have the APPLY TAG privilege on the object, the USE SCHEMA privilege on the object’s parent schema and the USE CATALOG privilege on the object’s parent catalog. This is pretty typical with Unity Catalog’s three-tiered hierarchy. If you are using SQL commands to manage tags, you can use the SET TAGS and UNSET TAGS clauses in the ALTER TABLE command.

You can use a fairly straightforward PySpark script to loop through a set of tables, look for a certain set of column names and then apply tags as appropriate. This can be done as an initial one-time run and then automated by creating a distinct job to check for new tables and/or columns or include in existing ingestion processes. There is a lot to be gained by augmenting this pipeline from just using a script that checks for columns named ‘ssn’ to creating an ML job that looks for fields that contain social security numbers.

Conclusion

I’ve seen a lot of companies struggle with populating their Databricks Lakehouse with sensitive data. In their current state, databases had a very limited set of users, so only people that were authorized to see certain data, like PII, had access to the database that stored this information. However, the utility of a lakehouse is dramatically reduced if you don’t allow sensitive data. In most cases, it just won’t get any enterprise traction. Leveraging all of the governance and security feature of Unity Catalog is a great, if not mandatory, first step. Enhancing governance and security, as well as utility, with tagging is probably going to be necessary to one degree or another in your organization to get broad usage and acceptance.

Contact us to learn more about how to build robustly governed solutions in Databricks for your organization.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

David Callaghan, Solutions Architect

As a solutions architect with Perficient, I bring twenty years of development experience and I'm currently hands-on with Hadoop/Spark, blockchain and cloud, coding in Java, Scala and Go. I'm certified in and work extensively with Hadoop, Cassandra, Spark, AWS, MongoDB and Pentaho. Most recently, I've been bringing integrated blockchain (particularly Hyperledger and Ethereum) and big data solutions to the cloud with an emphasis on integrating Modern Data produces such as HBase, Cassandra and Neo4J as the off-blockchain repository.

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram