Seth Rao
CEO at FirstEigen
Simpler Data Access and Controls With Unity Catalog
Foreword:
The below blog post is being reproduced on our website with permission from Speedboat.pro as it closely intertwines with FirstEigen’s DataBuck philosophy around building well-architected lakehouses.
When building data pipelines, a thorough validation of the data set upfront (I call it ‘defensive programming’) yields great rewards in terms of pipeline reliability and operational resilience. That said, hand-coding DQ rules is a non-trivial task. This is where FirstEigen’s DataBuck can apply sophisticated AI and ML techniques to automatically monitor your data for you. The best part – these DQ metrics are published into Unity Catalog so that all consumers of this data can access data quality stats as metadata and make intelligent and informed decisions on whether to use that data or not.
Databricks users get a free autonomous data validation add-on
Why Unity Catalog is the Backbone of Data Access Control in Databricks Lakehouses?
The core premise of Lakehouse is that all your data is stored in reasonably priced and infinitely scalable object stores such as S3, ADLS or GoogleStore while all of the data processing happens on reasonably-priced and infinitely-scalable VM based clusters on elastic compute clouds – classic separation of storage and compute.
One of the friction points with adopting lakehouse on Databricks has been the multitude of choices (and their attendant tradeoffs) in putting access controls around sensitive data on above-mentioned object stores. Over the years, many patterns, each with their own pros and cons, have come and gone for optimal and secure data access control. With Unity Catalog (UC), this complexity has been removed. We now have a simple, elegant, and secure pattern to work with data on the Lakehouse.
See the comparison diagram below to understand the considerations for each role and see how each data access and control situation stacks up.
Our take: We love how UC has simplified and unified data access and governance. We recommend you adopt Unity Catalog without hesitation for your data estate.
Comparison of the options to access data in the lakehouse
Understanding context: The path to Unity Catalog
For history-buffs like us, here is a quick walkthrough of the data access options, ordered chronologically, on Databricks over the years.
The Evolution of Access Control Patterns on Databricks
Pattern #1 – DBFS Mounts
Per Databricks, DBFS Mounts can be used to mount cloud storage to DBFS to simplify data access for users that are unfamiliar with cloud concepts. (See official documentation)
- PRO: Super simple. Easy for beginners to get started on Databricks
- CON: No access control at all. Data on DBFs can be read and written by all users on workspace. This includes all object stores mounted on DBFS.
- CON: No audit support
Our take: DBFS mounts is an insecure and obselete pattern. ABSOLUTELY DO NOT USE.
Pattern #2 Connecting to Object Store Using Azure/AWS Credentials
In this approach, the user passes in credentials (OAuth2.0, SAS tokens or Account Keys) through Spark properties. This approach is usually paired with Secrets API to ensure secrets do not leak. (See official documentation )
- CON: Code looks daunting to new users. Mistakes could result in leakage of keys.
- CON: Supporting row and column level access is complicated.
Our take: While this approach gets the job done (barely), it is anything but simple. USE ONLY WHEN NECESSARY.
Pattern #3 Table ACL Against Hive Metastores
Admins can grant and revoke access to objects in their workspace’s Hive metastore using Python and SQL. (See official documentation).
The hive metastore in question can be the default hive metastore (every workspace ships with one) or it can be an external metastore (this needs to be explicitly configured by workspace admins)
- PRO: Supports data access controls at dataset levels
- PRO: Row and column level access controls can be implemented through dynamic views
- PRO: Standard SQL syntax for grants and revokes
- CON: No centralized support when working with workspace-local hive metastores. Each workspace needs to be managed separately.
- CON: No audit support.
Our take: Table ACLs offered the best solution – until Unity Catalog came around. This pattern is now officially considered as a legacy pattern. DO NOT USE.
Pattern #4 Credential Passthrough
In Databricks’ own words, credential passthrough allows you to authenticate automatically to S3 buckets from Databricks clusters using the identity that you use to log in to Databricks.
Unlike option #2 above, the user does not need to supply any credentials in their code. (See official documentation)
- PRO: Supports data access controls at a dataset level.
- PRO: Super simple for new users. Clean code.
- CON: Does not work with table ACL. No support for row and column level controls.
- CON: Pushes ACL and audit burden on object store admin.
Our take: Despite limitations, It was the best option for a while. This pattern is now officially considered as a legacy pattern. DO NOT USE.
The Unity Catalog Advantage: Centralized, Secure, and Simple
UC was created to resolve the sprawl of workspace-local hive metastores/catalogs. Moving all these catalogs to the account level enables centralized access control, auditing, lineage, and data discovery capabilities across all Databricks workspaces. (See official documentation)
It takes the best features from all the options above and packages them into one architectural pattern.
We like working with UC because it simplifies and unifies how developers access their data. All data access will now be a read from a UC table or view. All writes will go into a UC table.
Another thing we really like is how UC removes from developers the burden of handling credentials. All credentials will now be handled by UC admins (who by nature will be senior and experienced resources). Users trying to access data only have to supply their own credentials to UC. UC, always the mediator, takes care of the rest of the complexity and only reveals to the user the data they have access to – and logs all these access requests along the way into audit logs.
On pricing – Unity Catalog is completely FREE to users on premium and enterprise Databricks. (Remember: Databricks only bills for compute time on clusters and SQL Warehouses. All other goodies – UC, Delta Lake, MLFLow et al are provided at no charge)
But wait, there’s more… The benefits of UC do not stop there (we’ll delve into other facets in upcoming posts). Expect to see much more from Databricks (GenAI integration aka LakehouseIQ, Auto-management of data estates and more) in the coming quarters.
Conclusion
If you already use Databricks, we strongly recommend that you incorporate Unity Catalog as the foundation of your data estate.
If you are still deciding on your lakehouse architecture, then our recommendation is to put Databricks and Unity Catalog on your evaluation list.
Ready to Strengthen Your Data Governance with Automated Quality Controls?
Unity Catalog simplifies access management, and FirstEigen’s DataBuck takes it a step further by automating data validation across your Databricks Lakehouse. Achieve greater data accuracy and trust with AI-driven monitoring that integrates seamlessly with Unity Catalog.
Learn More About DataBuck and see how it can elevate your data governance to new levels of reliability and efficiency.
Check out these articles on Data Trustability, Observability & Data Quality Management-
- Enterprise Data Catalog Tools
- Data Observability Tools Comparison
- Data Warehouse Architecture
- What is Enterprise Data Management?
- Synapse Azure Data Factory
- Data Warehouse Issues
- Automated Data Quality Management
- Data Lake vs Data Warehouse vs Data Mart
- Cloud Leaks
- Data Quality Platform Alation Integrations
FAQs
Unity Catalog is Databricks’ solution to centralized data access control, offering unified management of data access, discovery, auditing, and lineage across all workspaces. It simplifies secure access, helping organizations govern data consistently and efficiently.
Unity Catalog consolidates the best features of legacy methods (like table ACLs and credential passthrough) while adding centralized management, fine-grained access control, and unified auditing across all Databricks workspaces.
Yes, Unity Catalog enables row and column-level access controls, allowing more granular control over data and helping ensure that only authorized users can access specific portions of the data.
Unity is included for premium and enterprise Databricks users at no additional cost. Databricks only charges for the compute time used by clusters and SQL warehouses.
Yes, DataBuck integrates with Unity Catalog to provide automated data validation. DQ metrics from DataBuck are stored in Unity Catalog, making it easy for users to evaluate data quality directly within their Databricks environment.
Unity Catalog logs all data access requests, which supports compliance requirements and enhances transparency for monitoring data use. Audit logs help organizations track data usage patterns and maintain secure data practices.
Databricks provides tools and guidance for migrating to Unity Catalog, allowing admins to easily transition from older methods (like DBFS mounts or Hive metastore ACLs) to a centralized UC system.
Yes, Unity Catalog is designed to work seamlessly with Databricks’ broader suite, including future GenAI features like LakehouseIQ, which will further enhance data accessibility and automated management.
Absolutely. Unity Catalog handles all credentials through UC administrators, freeing developers from the responsibility of credential management, reducing errors, and improving security across the board.
Databricks plans ongoing enhancements, including tighter integrations with GenAI, expanded support for automated data estate management, and new tools to optimize data access and governance within the Lakehouse ecosystem.
Discover How Fortune 500 Companies Use DataBuck to Cut Data Validation Costs by 50%
Recent Posts
Get Started!