The importance of open table formats for modern data lakes

Open table formats offer enterprises the capabilities of data warehouses and the flexibility of data lakes.
April 7, 2025

Open table formats are reshaping how enterprises interact with data lakes. These formats—most notably Delta Lake and Apache Iceberg—bring much-needed ACID compliance, schema evolution, and data versioning to traditionally schema-less, append-only data lakes. Despite growing adoption, the industry has yet to settle on a single technical standard, resembling previous format wars

This ongoing competition offers both opportunities and challenges for those looking to build scalable, future-proof data architectures in pursuit of advanced analytics, including AI.

What are open table formats and why do they matter?

Open table formats are a layer of abstraction that wrap around data files, organizing them into a database-like structure. Their capabilities transform traditional data lakes into true data lakehouses, blending the best of data warehouses and data lakes through:

  • ACID transactions: Guaranteeing data integrity and validity in the midst of concurrent workloads and potential errors
  • Schema evolution: Allowing schema modifications without breaking existing queries and reducing the need to rewrite data files – historically a costly and time-consuming operation for structured data on the data lake
  • Time travel and versioning: Enabling rollback or historical data analysis
  • Partitioning and performance optimizations: Enhancing query speed and efficiency for relational data stored on a data lake

Traditionally, data lakes have provided inexpensive storage for vast amounts of structured and unstructured data. However, their lack of transaction support and governance mechanisms has led to significant challenges in maintaining data integrity, especially as regulatory compliance becomes increasingly stringent and organizations increasingly pursue advanced analytics and AI.

Delta Lake vs. Apache Iceberg

Today, two major contenders in the open table format space stand out: Delta Lake, developed by Databricks, and Apache Iceberg, an open-source project originally developed at Netflix. Both provide robust transactional support and performance enhancements, but they differ in key areas.

Delta Lake evolved closely alongside and is highly optimized for Databricks and Spark, gaining traction among enterprise customers due to its top-notch performance and support for real-time use cases.

Under the hood, Delta Lake uses a transaction log (_delta_log) to record actions like inserts, deletes, or schema changes. This log is append-only and provides a linear history of changes, making it simple to implement time travel and rollback functionality. Owing to the linear nature of the filesystem, querying Delta tables does not require a technical catalog (metastore). Data governance, however, still requires a catalog. Unity Catalog, in particular, is optimized to work with the broader Databricks ecosystem, although Delta Lake can also be combined with other data catalogs.

Apache Iceberg is a table format initially created by Netflix but has since seen a wide breadth of contributions from Apple, Netflix, Amazon, Adobe, and others. It enjoys broad query service ecosystem support through dedicated query platforms like Trino, Presto, and Flink, as well as warehouse platforms like Snowflake, BigQuery, and Redshift Spectrum. This makes Iceberg a popular choice for companies looking for a broadly accessible format.

Under the hood, Iceberg organizes metadata in a tree-like structure, with snapshots and manifests, which helps it scale more efficiently as the table grows. This makes it well-suited for petabyte-scale datasets and high-concurrency environments. Schema evolution can happen in place through metadata file updates, rather than needing to rewrite the data files themselves. However, due to the non-linear nature of Iceberg table management, the format requires a technical catalog for querying – to store and retrieve a table’s latest metadata, then return the correct state of the table. As with Delta Lake, Apache Iceberg may also be combined with a wide range of technical catalogs for governance.

Despite their respective strengths, neither format has emerged as an undisputed leader. As of April 2025, the Delta Lake GitHub shows 7.9k stars and 364 contributors, while that for Apache Iceberg shows 7.1k stars and 572 contributors; effectively neck and neck.

The risks posed by the format wars

For teams that intend to build data lakes with open table formats, the lack of a clear market leader presents some risks: 

  • Interoperability concerns: Choosing one format over another may limit compatibility with certain data processing engines, catalogs, cloud providers, or analytics tools, leading to potential vendor lock-in.
  • Future-proofing data architectures: Investing heavily in a single format may pose migration challenges down the road, especially in the (somewhat unlikely) event one format eventually runs the other out of business.
  • Operational complexity: Different business units within the same enterprise may manage multiple architectures with multiple open table formats, increasing maintenance overhead and the need for specialized expertise.

Fivetran’s approach: Interoperability without complexity

Fivetran’s approach to fully managed data lakes mitigates these risks by supporting the ability to write to both formats, substantially increasing the number of query engines that can be used on the same data without duplication or time-consuming format conversion for larger tables.

The Fivetran Managed Data Lake Service features automated data integration, writing data seamlessly to both Delta Lake and Apache Iceberg. This approach provides enterprises with:

  • Interoperability: A unified pipeline supports multiple query engines (e.g. Spark, Flink, Trino, Presto) and cloud providers.
  • Optionality: You can choose the appropriate format based on evolving needs, without being locked into a single vendor.
  • Simplicity: Through automated data integration, Fivetran handles the complexities of converting data from disparate sources into the open table format of your choice and ongoing schema evolution, reducing the engineering overhead of data teams.

With Fivetran’s approach, organizations gain the ability to support their data lakehouse with an automated, fully managed service without gambling on the outcome of the format wars.

The road ahead for open table formats

Open table formats will play a critical role in shaping the future of data lakes and data catalogs. While the industry has yet to declare a definitive winner between Delta Lake format and Apache Iceberg, businesses don’t have to wait for a resolution.

By leveraging solutions like the Fivetran Managed Data Lake Service, companies can enjoy the best of both worlds—ensuring ACID transactions, scalability, and governance without adding unnecessary complexity. Organizations that prioritize flexibility and interoperability will be best positioned for long-term success in the era of open table formats.

[CTA_MODULE]

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

The importance of open table formats for modern data lakes

The importance of open table formats for modern data lakes

April 7, 2025
April 7, 2025
The importance of open table formats for modern data lakes
Open table formats offer enterprises the capabilities of data warehouses and the flexibility of data lakes.

Open table formats are reshaping how enterprises interact with data lakes. These formats—most notably Delta Lake and Apache Iceberg—bring much-needed ACID compliance, schema evolution, and data versioning to traditionally schema-less, append-only data lakes. Despite growing adoption, the industry has yet to settle on a single technical standard, resembling previous format wars

This ongoing competition offers both opportunities and challenges for those looking to build scalable, future-proof data architectures in pursuit of advanced analytics, including AI.

What are open table formats and why do they matter?

Open table formats are a layer of abstraction that wrap around data files, organizing them into a database-like structure. Their capabilities transform traditional data lakes into true data lakehouses, blending the best of data warehouses and data lakes through:

  • ACID transactions: Guaranteeing data integrity and validity in the midst of concurrent workloads and potential errors
  • Schema evolution: Allowing schema modifications without breaking existing queries and reducing the need to rewrite data files – historically a costly and time-consuming operation for structured data on the data lake
  • Time travel and versioning: Enabling rollback or historical data analysis
  • Partitioning and performance optimizations: Enhancing query speed and efficiency for relational data stored on a data lake

Traditionally, data lakes have provided inexpensive storage for vast amounts of structured and unstructured data. However, their lack of transaction support and governance mechanisms has led to significant challenges in maintaining data integrity, especially as regulatory compliance becomes increasingly stringent and organizations increasingly pursue advanced analytics and AI.

Delta Lake vs. Apache Iceberg

Today, two major contenders in the open table format space stand out: Delta Lake, developed by Databricks, and Apache Iceberg, an open-source project originally developed at Netflix. Both provide robust transactional support and performance enhancements, but they differ in key areas.

Delta Lake evolved closely alongside and is highly optimized for Databricks and Spark, gaining traction among enterprise customers due to its top-notch performance and support for real-time use cases.

Under the hood, Delta Lake uses a transaction log (_delta_log) to record actions like inserts, deletes, or schema changes. This log is append-only and provides a linear history of changes, making it simple to implement time travel and rollback functionality. Owing to the linear nature of the filesystem, querying Delta tables does not require a technical catalog (metastore). Data governance, however, still requires a catalog. Unity Catalog, in particular, is optimized to work with the broader Databricks ecosystem, although Delta Lake can also be combined with other data catalogs.

Apache Iceberg is a table format initially created by Netflix but has since seen a wide breadth of contributions from Apple, Netflix, Amazon, Adobe, and others. It enjoys broad query service ecosystem support through dedicated query platforms like Trino, Presto, and Flink, as well as warehouse platforms like Snowflake, BigQuery, and Redshift Spectrum. This makes Iceberg a popular choice for companies looking for a broadly accessible format.

Under the hood, Iceberg organizes metadata in a tree-like structure, with snapshots and manifests, which helps it scale more efficiently as the table grows. This makes it well-suited for petabyte-scale datasets and high-concurrency environments. Schema evolution can happen in place through metadata file updates, rather than needing to rewrite the data files themselves. However, due to the non-linear nature of Iceberg table management, the format requires a technical catalog for querying – to store and retrieve a table’s latest metadata, then return the correct state of the table. As with Delta Lake, Apache Iceberg may also be combined with a wide range of technical catalogs for governance.

Despite their respective strengths, neither format has emerged as an undisputed leader. As of April 2025, the Delta Lake GitHub shows 7.9k stars and 364 contributors, while that for Apache Iceberg shows 7.1k stars and 572 contributors; effectively neck and neck.

The risks posed by the format wars

For teams that intend to build data lakes with open table formats, the lack of a clear market leader presents some risks: 

  • Interoperability concerns: Choosing one format over another may limit compatibility with certain data processing engines, catalogs, cloud providers, or analytics tools, leading to potential vendor lock-in.
  • Future-proofing data architectures: Investing heavily in a single format may pose migration challenges down the road, especially in the (somewhat unlikely) event one format eventually runs the other out of business.
  • Operational complexity: Different business units within the same enterprise may manage multiple architectures with multiple open table formats, increasing maintenance overhead and the need for specialized expertise.

Fivetran’s approach: Interoperability without complexity

Fivetran’s approach to fully managed data lakes mitigates these risks by supporting the ability to write to both formats, substantially increasing the number of query engines that can be used on the same data without duplication or time-consuming format conversion for larger tables.

The Fivetran Managed Data Lake Service features automated data integration, writing data seamlessly to both Delta Lake and Apache Iceberg. This approach provides enterprises with:

  • Interoperability: A unified pipeline supports multiple query engines (e.g. Spark, Flink, Trino, Presto) and cloud providers.
  • Optionality: You can choose the appropriate format based on evolving needs, without being locked into a single vendor.
  • Simplicity: Through automated data integration, Fivetran handles the complexities of converting data from disparate sources into the open table format of your choice and ongoing schema evolution, reducing the engineering overhead of data teams.

With Fivetran’s approach, organizations gain the ability to support their data lakehouse with an automated, fully managed service without gambling on the outcome of the format wars.

The road ahead for open table formats

Open table formats will play a critical role in shaping the future of data lakes and data catalogs. While the industry has yet to declare a definitive winner between Delta Lake format and Apache Iceberg, businesses don’t have to wait for a resolution.

By leveraging solutions like the Fivetran Managed Data Lake Service, companies can enjoy the best of both worlds—ensuring ACID transactions, scalability, and governance without adding unnecessary complexity. Organizations that prioritize flexibility and interoperability will be best positioned for long-term success in the era of open table formats.

[CTA_MODULE]

Experience Fivetran Managed Data Lake Service for yourself.
Sign up
Topics
Share

Verwandte Beiträge

Fivetran supports Amazon S3 as a destination with open table formats
Product

Fivetran supports Amazon S3 as a destination with open table formats

Beitrag lesen
Build your open data lakehouse on Apache Iceberg tables with Dremio and Fivetran
Data insights

Build your open data lakehouse on Apache Iceberg tables with Dremio and Fivetran

Beitrag lesen
What is a data lake?
Data insights

What is a data lake?

Beitrag lesen
No items found.
Fivetran and Google Cloud solve data integration for AI
Blog

Fivetran and Google Cloud solve data integration for AI

Beitrag lesen
Why data lakes are the keystone of AI workloads
Blog

Why data lakes are the keystone of AI workloads

Beitrag lesen
Fivetran Product Update: April 2025
Blog

Fivetran Product Update: April 2025

Beitrag lesen

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.