Guides

Apache Iceberg: The open table format for modern data lakes

June 1, 2026
Learn how Apache Iceberg open format table enables ACID transactions, schema evolution, and multi-engine interoperability with Fivetran.

Engine interoperabilityFor a decade, architecture owners have been trapped in a costly cycle: buy a new compute engine, migrate data into its proprietary format, and get locked in — only to do it all over again when formats change. 

Apache Iceberg breaks this cycle by unlocking the separation of storage from compute. It’s the open infrastructure layer that enables multi-engine access, ACID reliability, and true interoperability across your data lake and downstream engines.

Learn how Iceberg’s metadata layer works and why it’s the first step to building a data architecture that scales without vendor constraints.

What is Apache Iceberg?

Apache Iceberg is an open table format for large-scale analytics tables. It sits above the raw data files in a data lake and enables:

  • ACID transactions: Guarantees data reliability during simultaneous read/write operations.
  • Schema evolution: Allows you to add, drop, or rename columns without rewriting underlying data.
  • Time travel: Enables querying historical data snapshots for auditing or rollback.
  • Multi-engine access on object storage: Lets different compute engines, like Spark or Trino, query the same data simultaneously.

To understand the Iceberg file value, you must first understand the concept of a table format. A table format is the metadata layer that explicitly defines a table’s schema, tracks its history, and maps exactly which physical files belong to it at any given moment. This metadata layer allows teams to perform SQL-like queries directly on object storage (like Google Cloud Storage) without moving the data into a traditional database.

About 78% of data professionals use the Iceberg file format exclusively. It’s been adopted natively across Spark, Trino, Flink, Presto, and Hive, making it the de facto standard for the modern data lakehouse

Before Iceberg, data teams relied on older, Hive-style table definitions. Hive tracked tables using a directory-based approach — if a file was in a specific folder path (like year=2024/ month=10), the system assumed it belonged to that table. This approach was brittle because it couldn’t guarantee ACID compliance, and simultaneous read/write operations frequently resulted in data corruption or inaccurate queries.

Netflix created the Iceberg format around 2018 specifically to address these Hive limitations at scale. They needed a format that tracked data at the file level and not at the directory level. Netflix later donated the project to the Apache Software Foundation. Today, Iceberg operates under an Apache 2.0 license with a community-governed, open-source status. 

How the Apache Iceberg architecture works: Deep dive

Iceberg’s utility comes from its separation of concerns. Instead of letting the compute engine dictate how data is stored, Iceberg acts as a stand-alone metadata manager, organizing data into three distinct layers. This architecture guarantees that multiple engines can read and write simultaneously without stepping on each other to ensure no single component becomes a bottleneck.

Here’s how the three layers interact.

Layer Primary function Key component
Catalog The entry point and absolute source of truth for the table’s current state. Pointer to the active metadata file.
Metadata Tracks schema, partition rules, snapshots, and file-level statistics for pruning. Metadata files, manifest lists, and manifest files.
Data The physical storage of the actual records in object storage. Parquet, ORC, or Avro files.

Catalog layer

The catalog layer is the absolute entry point for any Iceberg query. When an engine like Spark or Trino needs to access an Iceberg table, it first asks the catalog for the current location of the metadata pointer — a single file path that acts as the source of truth for the table’s current state. 

By relying on this single pointer, Iceberg ensures atomic transactions that prevent incomplete or inaccurate data from ever reaching analytics. For example, if a pipeline is writing 10 million new customer records and the cluster crashes at 9 million, the pointer simply never updates. Those 9 million partially written records remain entirely invisible to downstream dashboards, preventing analysts from reporting inaccurate revenue numbers.

Metadata layer

Once the catalog provides the pointer, the compute engine reads the metadata layer, which is structured as a hierarchical tree of files. At the very top is the metadata file itself, which stores the table’s schema, partition specifications, and snapshots. Below that sit the manifest lists, which point to individual manifest files. 

These manifest files track the exact paths of the physical data files and store column-level statistics like minimum and maximum values. Engines use these statistics to aggressively “prune” (or skip over) files during a query, which cuts compute costs significantly.

Data layer

The data layer is where the actual records reside. Iceberg doesn’t dictate a proprietary file format for this layer — instead, it stores data in standard, open-source formats like Apache Parquet, ORC, or Avro, directly within your object storage. 

Because the metadata layer tracks everything precisely at the file level, you can optimize your data pipelines with Apache Iceberg without ever moving or rewriting the underlying physical files.

Key features of Apache Iceberg

Iceberg shifts how organizations manage modern data at scale. It takes the capabilities that were previously only available in monolithic, proprietary data warehouses and brings them directly to the open data lake, giving architecture owners the flexibility they need to scale. 

Here are some of the features that make it possible. 

ACID transactions at scale

In a traditional data lake environment, simultaneous writes frequently cause data corruption. If a query runs while a separate job is writing new files, the end user might see partial, duplicated, or inaccurate data. 

Iceberg solves this problem using optimistic concurrency control, so multiple writers can operate on the exact same table simultaneously. If two writers attempt to modify the same data at the same time, Iceberg automatically detects the conflict and forces the second writer to retry, guaranteeing strict ACID compliance across your entire architecture.

Schema evolution

Enterprise data schemas change constantly as columns are added, renamed, or dropped. Iceberg handles schema evolution entirely within its metadata layer by assigning a unique ID to every column. If someone renames a column, the ID stays the same. If they drop a column and add a new one with the same name, Iceberg assigns a new ID. This tracking mechanism prevents the old “zombie data” from reappearing and allows teams to evolve schemas instantly without touching the physical data files.

Partition evolution and hidden partitioning

Partitioning data by date or region significantly speeds up queries, but business requirements often evolve. If you started by partitioning monthly and now need to partition by day, older formats would force you to rewrite all historical data to match the new layout.

But Iceberg supports partition evolution, meaning you can change the partition layout going forward while still querying both the old and new layouts together. Plus, it uses hidden partitioning, which means the compute engine automatically handles the partition logic based on the column values. Users don’t need to write complex SQL filters to hit the correct partitions.

Time travel and rollback

Suppose a bad pipeline run corrupts your table. Because Iceberg creates a new, distinct snapshot for every single change made to a table, it maintains a complete and auditable history. Teams can query the table exactly as it looked at a specific timestamp or snapshot ID, and fix the bad pipeline run. This time travel capability helps debug complex pipelines or reproduce the exact dataset used to train an AI model. 

Multi-engine interoperability

Historically, data gravity meant architecture owners had to run all their workloads on the specific platform where the data lived. In Iceberg, the metadata is standardized and open, so you can use the best compute engine for each specific job.

You might run heavy batch transformations with Spark, or serve ad-hoc analytics with Trino — hitting the same Iceberg table simultaneously. This interoperability is why open table formats work so well for modern data lakes.

Copy-on-Write (CoW) vs. Merge-on-Read (MoR)

Updating or deleting records in a traditional data lake is notoriously slow and resource-intensive. Iceberg offers two distinct strategies to handle these operations efficiently. 

CoW rewrites the entire data file whenever a record is updated. This slows down write speeds but guarantees extremely fast read performance because updated files are ready for immediate scanning.

MoR writes the updates to a separate “delete file.” This makes write operations incredibly fast, but slows the read process as the compute engine must merge the base data files with the delete files dynamically during query execution. 

Architecture owners can configure this setting at the table level, optimizing performance based on the specific read/write ratio of the workload rather than forcing a single strategy across all jobs.

Apache Iceberg vs. Delta Lake vs. Apache Hudi

The open table format currently has three major open-source projects: Apache Iceberg, Delta Lake, and Apache Hudi. 

While all three aim to bring ACID transactions and warehouse-like features to the data lake, their origins, governance models, and architectural philosophies differ significantly. 

If you’re trying to avoid vendor lock-in, here’s what you need to know.

Feature Apache Iceberg Delta Lake Apache Hudi
Origin Netflix Databricks Uber
Governance Apache Software Foundation (open community) Linux Foundation (historically Databricks-led) Apache Software Foundation (open community)
Primary use case Multi-engine analytics and extreme scale Deep integration with Spark and Databricks ecosystem Streaming data and fast upserts
Schema evolution Full (in-place updates via ID tracking) Partial (some changes require rewriting data files) Partial (historically tied to Avro schemas)
Engine interoperability High (native support across Spark, Trino, Flink, Snowflake, etc.) Moderate (heavily optimized for Spark, but others require external readers) Moderate (strong Spark/Flink focus)

Why Apache Iceberg’s neutrality makes it best for data architects

Delta Lake was developed by Databricks and remains deeply entwined with their commercial ecosystem. While it’s open source under the Linux Foundation, the development roadmap and primary optimizations are heavily skewed toward Spark and Databricks compute. If your entire architecture runs on Databricks, then Delta Lake is the path of least resistance. However, if you want true multi-engine flexibility, Delta Lake will be more of a liability.

Apache Hudi was built by Uber specifically to handle massive streams of real-time data and fast upserts. It excels at streaming use cases but is overly complex to configure and manage for general-purpose analytics.

Apache Iceberg was designed from the ground up for engine neutrality. And because it’s governed by the Apache Software Foundation and not by a single commercial vendor, it has seen the broadest adoption. Snowflake, AWS, Google Cloud, and Cloudera all provide native, first-class support for Iceberg tables. 

Iceberg’s neutrality is why it’s currently leading the market in planned adoption. It’s the only format that truly guarantees you won’t be locked into a specific compute engine three years from now.

Apache Iceberg and Open Data Infrastructure: Why the table format matters for vendor independence

The conversation around Apache Iceberg often gets bogged down in technical details like manifest files and partition evolution. But for an architecture owner, the value of Iceberg is strategic: It’s the foundational layer of Open Data Infrastructure (ODI). 

ODI is a design philosophy that prioritizes interoperability, open standards, and the separation of storage from compute. It directly challenges the traditional approach of proprietary data platforms, which locks data behind closed formats and vendor-specific engines. The openness matters now more than ever as AI systems increasingly need to access data from every corner of the enterprise.

Breaking the compute monopoly

When you store data in a proprietary format inside a cloud data warehouse, that vendor controls access. If you want to use a new AI training engine or a cheaper SQL engine to query data, you must first extract it, transform it, and load it into a new system. And you pay for everything: the data storage, the compute to move data, and the compute to query it. 

With Iceberg, you standardize the storage layer on an open table format and choose the best compute engine for each job, optimizing the compute layer. Now, if a vendor raises their compute prices — or a new, faster engine hits the market — you simply point the new engine at the existing Iceberg catalog.

ODI ensures your data remains an independent, accessible asset rather than a hostage to the compute provider.

AI-ready interoperability

AI agents and machine learning (ML) models require massive volumes of high-quality, reproducible data. Building complex, custom pipelines to extract this data from siloed operational databases or proprietary warehouses is slow and fragile.

By using Iceberg, you create a single, reliable source of truth. Data lands in the open lakehouse once, and any tool — whether it’s a BI dashboard or a Spark ML job — can query it simultaneously with full ACID guarantees. 

This interoperability ensures your architecture and lakehouses are AI-ready, allowing engineering teams to focus on training models rather than maintaining data pipelines.

How Fivetran supports Apache Iceberg

The ODI described above only delivers value if data actually flows into it reliably. For most engineering teams, this is where the strategy breaks down — building and maintaining custom ingestion pipelines for hundreds of SaaS applications and databases is a full-time job. 

Every new source requires a custom API integration, error handling logic, and ongoing maintenance as upstream schemas change. That engineering effort compounds quickly, and it pulls your best people away from the AI and analytics work that pushes the business forward.

Fivetran ensures your data is AI-ready without the added maintenance. With 750+ pre-built and fully managed connectors spanning databases, enterprise platforms, and event stream platforms, Fivetran Managed Data Lake Service delivers data directly into your Iceberg data lake with native format support. 

Plus, schema evolution is fully automated and handled for you in the Fivetran Managed Data Lake service. When you add a column or change a data type in the source system, Fivetran detects the update and automatically applies it to your Iceberg or Delta Lake tables. The service also manages and updates the underlying metadata and catalog entries for you, keeping downstream queries stable without manual fixes and enabling true ODI.

For database sources, Fivetran uses log-based change data capture to read only the incremental changes from transaction logs rather than running expensive full-table scans. Those changes are then merged directly into your Iceberg tables using the format’s native write capabilities. 

The result is a continuously fresh, production-ready data lake that requires zero pipeline maintenance from your engineering team. With Fivetran, your data stays current, schemas stay aligned, and engineers stay focused on developing high-impact AI models and analytics products.

Get started with Fivetran today.

FAQ

What is the difference between Apache Iceberg and a data lake?

A data lake is a storage environment, typically built on cloud object storage like Amazon S3 or Azure Data Lake Storage. On its own, a data lake has no understanding of schemas, tables, or transactions. Apache Iceberg adds that missing intelligence, adding a table format layer that sits on top of the data lake and organizes raw files into structured, queryable tables. You don’t choose between the two. You use Iceberg to turn the data in your lake into actionable insights and information for AI systems to pull from.

Is Apache Iceberg replacing Delta Lake and Hudi?

Not officially, but the industry is converging around Iceberg. Databricks introduced Delta UniForm specifically to make Delta tables readable as Iceberg, since Snowflake and Dremio have all built native Iceberg support. Hudi still serves streaming-heavy use cases, but its adoption has plateaued. Fivetran delivers data in both Iceberg and Delta through the Fivetran Managed Data Lake service, making it a strong addition to any modern data stack.

Does Apache Iceberg work with Snowflake and Databricks?

Yes, Snowflake supports Iceberg through external volumes, letting you register Iceberg tables in your own object storage and query them directly without copying data. Databricks uses Delta UniForm, which writes Delta tables that are simultaneously readable as Iceberg. However, Snowflake external volumes keep data under your governance, while Databricks UniForm keeps Delta as the primary format. That distinction matters when designing for long-term portability.

What is Apache Iceberg and why does it matter for AI workloads?

AI models and agents demand data that’s accurate, versioned, and reproducible. Apache Iceberg facilitates that. For instance, Iceberg’s time travel feature lets data scientists query a table exactly as it existed at a specific point in time, which is critical for reproducing training runs and debugging model drift. Its open, multi-engine architecture allows specialized AI compute engines to process data directly in the lake rather than extracting it into proprietary systems. For RAG pipelines and ML feature stores, Iceberg provides the governed foundation that ensures every model works from the same trusted dataset.

[CTA_MODULE]

Start your 14-day free trial with Fivetran today!
Get started now and see how Fivetran fits into your stack

Related posts

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.