Understanding Iceberg support in the Databricks ecosystem

Explore how Databricks integrates Iceberg support via Unity Catalog and UniForm, enabling seamless data management within Spark-based environments.
September 30, 2024

In many ways, Spark is responsible for the creation and excitement around Iceberg. As a result Spark has the best support for Iceberg ecosystem. And given that Databricks is largely built on and around Spark, it’s also not surprising to find that it has great support for working with Iceberg and Data Lakehouse architectures generally.

Unity Catalog, which now a core part of any Databricks deployment provides a lot of the governance capabilities of the platform, but it also now can natively act as as an Iceberg Catalog. All you need to do is get Iceberg tables within your Databricks workspace. Enter UniForm.

By default in Databricks, data is stored in the Delta Lake table format. This is a competitor to Iceberg. That means your Databricks today is filled with Delta Lake-formatted tables. UniForm can automatically generate the Iceberg metadata tables alongside Delta Lake so you don’t need to convert or create Iceberg-specific tables within Databricks. You just need to opt them into UniForm by setting the following two table properties. At the moment, this needs to set on a per-table basis:

'delta.enableIcebergCompatV2' = 'true'
'delta.universalFormat.enabledFormats' = 'iceberg'

UniForm compatibility isn’t perfect yet. For example, doesn’t work if your previous table had deletion vectors enabled. So enabling for your particular table may require a few additional steps.

UniForm / Unity Catalog doesn’t seem to support Iceberg Views yet. You can set these table properties on a view but SHOW VIEWS returns an Catalog unity does not support views error in Spark.

Once you’ve set that up on your tables, you can now access them directly by using Iceberg Catalog REST API for Unity Catalog (and they launched OAuth support in the time it took me to write this article).

To connect in Spark, create a new REST connection. Databricks doesn’t support Iceberg-Access-Delegation header (yet, looks like it’s coming) so you’ll also need to provide the credentials to the underlying object storage as well (in this case, S3)

Once you’re connected, your Databricks tables and schemas will be mounted under the unity catalog for Spark. Querying is straightforward. Note that writing is not supported at this time.

spark.sql('SELECT * FROM unity.databricks_catalog.my_table;').show()

One interesting quirk: Querying nested namespaces (which a default Databricks unity catalog will have) requires a bit of fancy escaping around the namespaces (the names between the top level spark catalog, and the table/view:

The It turns out that, Iceberg REST catalog uses the unit separator to indicate separate namespaces. Databricks’ API does not like this and throws an error, likely for very fundamental reasons. Escaping the nested namespaces using backticks will send the . character to the Databricks API raw. It shows that the standard is early but moving really quickly, expect a fix for this issue to roll out soon.

What about Tabular acquisition

If you don’t follow tech industry news, you may not have seen Databrick’s monster acquisition of Tabular, the Iceberg data catalog, back in June 2024.

It’s early days and remains to be seen how Tabular’s own catalog and table format expertise will be applied to Databrick’s similar offerings. Tabular itself is no longer accepting new signups so unless you were already a user, Tabular is probably not an option you need to worry about.

But don’t ignore it completely. Tabular still hosts some of the best Iceberg documentation on the internet today. We’ll keep a close eye on how the Databricks and Tabular teams work together.

[CTA_MODULE]

Data insights
Data insights

Understanding Iceberg support in the Databricks ecosystem

Understanding Iceberg support in the Databricks ecosystem

September 30, 2024
September 30, 2024
Understanding Iceberg support in the Databricks ecosystem
Explore how Databricks integrates Iceberg support via Unity Catalog and UniForm, enabling seamless data management within Spark-based environments.

In many ways, Spark is responsible for the creation and excitement around Iceberg. As a result Spark has the best support for Iceberg ecosystem. And given that Databricks is largely built on and around Spark, it’s also not surprising to find that it has great support for working with Iceberg and Data Lakehouse architectures generally.

Unity Catalog, which now a core part of any Databricks deployment provides a lot of the governance capabilities of the platform, but it also now can natively act as as an Iceberg Catalog. All you need to do is get Iceberg tables within your Databricks workspace. Enter UniForm.

By default in Databricks, data is stored in the Delta Lake table format. This is a competitor to Iceberg. That means your Databricks today is filled with Delta Lake-formatted tables. UniForm can automatically generate the Iceberg metadata tables alongside Delta Lake so you don’t need to convert or create Iceberg-specific tables within Databricks. You just need to opt them into UniForm by setting the following two table properties. At the moment, this needs to set on a per-table basis:

'delta.enableIcebergCompatV2' = 'true'
'delta.universalFormat.enabledFormats' = 'iceberg'

UniForm compatibility isn’t perfect yet. For example, doesn’t work if your previous table had deletion vectors enabled. So enabling for your particular table may require a few additional steps.

UniForm / Unity Catalog doesn’t seem to support Iceberg Views yet. You can set these table properties on a view but SHOW VIEWS returns an Catalog unity does not support views error in Spark.

Once you’ve set that up on your tables, you can now access them directly by using Iceberg Catalog REST API for Unity Catalog (and they launched OAuth support in the time it took me to write this article).

To connect in Spark, create a new REST connection. Databricks doesn’t support Iceberg-Access-Delegation header (yet, looks like it’s coming) so you’ll also need to provide the credentials to the underlying object storage as well (in this case, S3)

Once you’re connected, your Databricks tables and schemas will be mounted under the unity catalog for Spark. Querying is straightforward. Note that writing is not supported at this time.

spark.sql('SELECT * FROM unity.databricks_catalog.my_table;').show()

One interesting quirk: Querying nested namespaces (which a default Databricks unity catalog will have) requires a bit of fancy escaping around the namespaces (the names between the top level spark catalog, and the table/view:

The It turns out that, Iceberg REST catalog uses the unit separator to indicate separate namespaces. Databricks’ API does not like this and throws an error, likely for very fundamental reasons. Escaping the nested namespaces using backticks will send the . character to the Databricks API raw. It shows that the standard is early but moving really quickly, expect a fix for this issue to roll out soon.

What about Tabular acquisition

If you don’t follow tech industry news, you may not have seen Databrick’s monster acquisition of Tabular, the Iceberg data catalog, back in June 2024.

It’s early days and remains to be seen how Tabular’s own catalog and table format expertise will be applied to Databrick’s similar offerings. Tabular itself is no longer accepting new signups so unless you were already a user, Tabular is probably not an option you need to worry about.

But don’t ignore it completely. Tabular still hosts some of the best Iceberg documentation on the internet today. We’ll keep a close eye on how the Databricks and Tabular teams work together.

[CTA_MODULE]

Ready to get started with Fivetran Activations?
Start your free trial
Topics
Share

Related blog posts

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.