The below blog post is being reproduced on our website with permission from Speedboat.pro as it closely intertwines with FirstEigen’s DataBuck philosophy around building well-architected lakehouses.
When building data pipelines, a thorough validation of the data set upfront (I call it ‘defensive programming’) yields great rewards in terms of pipeline reliability and operational resilience. That said, hand-coding DQ rules is a non-trivial task. This is where FirstEigen’s DataBuck can apply sophisticated AI and ML techniques to automatically monitor your data for you. The best part – these DQ metrics are published into Unity Catalog so that all consumers of this data can access data quality stats as metadata and make intelligent and informed decisions on whether to use that data or not.
The core premise of Lakehouse is that all your data is stored in reasonably priced and infinitely scalable object stores such as S3, ADLS or GoogleStore while all of the data processing happens on reasonably-priced and infinitely-scalable VM based clusters on elastic compute clouds – classic separation of storage and compute.
One of the friction points with adopting lakehouse on Databricks has been the multitude of choices (and their attendant tradeoffs) in putting access controls around sensitive data on above-mentioned object stores. Over the years, many patterns, each with their own pros and cons, have come and gone for optimal and secure data access control. With Unity Catalog (UC), this complexity has been removed. We now have a simple, elegant, and secure pattern to work with data on the Lakehouse.
See the comparison diagram below to understand the considerations for each role and see how each data access and control situation stacks up.
Our take: We love how UC has simplified and unified data access and governance. We recommend you adopt Unity Catalog without hesitation for your data estate.
Comparison of the options to access data in the lakehouse
Understanding context: The path to Unity Catalog
For history-buffs like us, here is a quick walkthrough of the data access options, ordered chronologically, on Databricks over the years.
Pattern #1 – DBFS Mounts
Per Databricks, DBFS Mounts can be used to mount cloud storage to DBFS to simplify data access for users that are unfamiliar with cloud concepts. (See official documentation)
- PRO: Super simple. Easy for beginners to get started on Databricks
- CON: No access control at all. Data on DBFs can be read and written by all users on workspace. This includes all object stores mounted on DBFS.
- CON: No audit support
Our take: DBFS mounts is an insecure and obselete pattern. ABSOLUTELY DO NOT USE.
Pattern #2 Connecting to object store using Azure/AWS Credentials
In this approach, the user passes in credentials (OAuth2.0, SAS tokens or Account Keys) through Spark properties. This approach is usually paired with Secrets API to ensure secrets do not leak. (See official documentation )
- CON: Code looks daunting to new users. Mistakes could result in leakage of keys.
- CON: Supporting row and column level access is complicated.
Our take: While this approach gets the job done (barely), it is anything but simple. USE ONLY WHEN NECESSARY.
Pattern #3 Table ACL against Hive Metastores
Admins can grant and revoke access to objects in their workspace’s Hive metastore using Python and SQL. (See official documentation).
The hive metastore in question can be the default hive metastore (every workspace ships with one) or it can be an external metastore (this needs to be explicitly configured by workspace admins)
- PRO: Supports data access controls at dataset levels
- PRO: Row and column level access controls can be implemented through dynamic views
- PRO: Standard SQL syntax for grants and revokes
- CON: No centralized support when working with workspace-local hive metastores. Each workspace needs to be managed separately.
- CON: No audit support.
Our take: Table ACLs offered the best solution – until Unity Catalog came around. This pattern is now officially considered as a legacy pattern. DO NOT USE.
Pattern #4 Credential Passthrough
In Databricks’ own words, credential passthrough allows you to authenticate automatically to S3 buckets from Databricks clusters using the identity that you use to log in to Databricks.
Unlike option #2 above, the user does not need to supply any credentials in their code. (See official documentation)
- PRO: Supports data access controls at a dataset level.
- PRO: Super simple for new users. Clean code.
- CON: Does not work with table ACL. No support for row and column level controls.
- CON: Pushes ACL and audit burden on object store admin.
Our take: Despite limitations, It was the best option for a while. This pattern is now officially considered as a legacy pattern. DO NOT USE.
Unity Catalog (UC) – Some more color
UC was created to resolve the sprawl of workspace-local hive metastores/catalogs. Moving all these catalogs to the account level enables centralized access control, auditing, lineage, and data discovery capabilities across all Databricks workspaces. (See official documentation)
It takes the best features from all the options above and packages them into one architectural pattern.
We like working with UC because it simplifies and unifies how developers access their data. All data access will now be a read from a UC table or view. All writes will go into a UC table.
Another thing we really like is how UC removes from developers the burden of handling credentials. All credentials will now be handled by UC admins (who by nature will be senior and experienced resources). Users trying to access data only have to supply their own credentials to UC. UC, always the mediator, takes care of the rest of the complexity and only reveals to the user the data they have access to – and logs all these access requests along the way into audit logs.
On pricing – Unity Catalog is completely FREE to users on premium and enterprise Databricks. (Remember: Databricks only bills for compute time on clusters and SQL Warehouses. All other goodies – UC, Delta Lake, MLFLow et al are provided at no charge)
But wait, there’s more… The benefits of UC do not stop there (we’ll delve into other facets in upcoming posts). Expect to see much more from Databricks (GenAI integration aka LakehouseIQ, Auto-management of data estates and more) in the coming quarters.
If you already use Databricks, we strongly recommend that you incorporate Unity Catalog as the foundation of your data estate.
If you are still deciding on your lakehouse architecture, then our recommendation is to put Databricks and Unity Catalog on your evaluation list.