Aperçus de données

Get started with Apache Iceberg without the brain freeze

February 6, 2026

Andrew Madson

Principal, Developer Relations

SUJETS

How to build an interoperable storage layer on a data lake without ever learning iceberg.

Apache Iceberg^TM now powers modern lakehouses everywhere, enabling interoperable, multi-engine workflows, and finally making data lakes reliable enough for production. Every major cloud vendor supports it. Every data platform wants to integrate with it.

There's just one problem: learning the intricacies of Iceberg is hard.

You need to understand manifest files, snapshot management, catalog services, and compaction strategies. You need to run infrastructure, schedule maintenance jobs, and govern how teams access data. For most organizations, that learning curve is steep enough to kill the initiative before it delivers value.

Here's the good news: with Fivetran, you don't actually need to learn any of that.

You can build a fully interoperable storage layer on your data lake using Fivetran's Managed Data Lake Service. You'll get everything Iceberg offers, including ACID transactions, schema evolution, time travel, and multi-engine access, without worrying about its internals. We handle the complexity. You just use your data the way you always have, just without the lock-in.

[CTA_MODULE]

What "interoperable" means

When your data lake is interoperable, you're not locked into any single query engine or vendor. Use your favorite Iceberg-compatible query engines:

DuckDB - Yes!
Trino - Absolutely!
Snowflake - You've got it!
BigQuery (With native metadata synchronization to BigQuery Metastore, query your Google Cloud Service data directly using BigQuery) - Let's go!

Any number of Iceberg-compatible query engines are at your disposal. Match your engine to your workload, not your storage vendor’s preferences. Your data sits in open formats on standard cloud storage, and multiple tools can read it directly.

Compare that to a traditional data warehouse. Every read and write goes through one vendor's system. Want to use a different tool? You're either paying for an expensive integration, maintaining a separate pipeline and destination, or out of luck. Your data is trapped.

An interoperable lake flips that model. You land data once and use it everywhere. Analysts query the same source using Snowflake. Your data science team hits it from Spark notebooks. ML engineers access it through Databricks for model training. BI analysts connect Trino for ad-hoc reporting. Everyone reads the same underlying data, not disparate copies scattered across platforms.

The benefits compound quickly. No redundant pipelines means less code to maintain. No duplicate datasets means you're not paying to store the same information multiple times. And when you need to switch tools or add a new one, you don't start from scratch with another ingestion process.

This is what open table formats like Iceberg enable. The question is whether you want to build and operate that infrastructure yourself.

Why most teams shouldn't learn Iceberg

Iceberg adds a metadata layer on top of Parquet files that makes a folder of data behave like a database table. Query engines read the metadata first, then go directly to the files they need. No directory scanning. No partial reads. No consistency problems.

You get ACID transactions so writes are atomic. Schema evolution so you can change columns without breaking queries. Time travel through snapshots. Efficient updates and deletes. With proper configuration, queries are fast even on massive tables.

The catch is that someone has to operate and maintain all of this.

You need a catalog service. Iceberg requires a metadata catalog to track table definitions and snapshots. This catalog becomes a critical dependency. If it goes down, nothing works. If it gets corrupted, you're in trouble. You need monitoring, backups, and probably multi-region redundancy for production workloads.

Data ingestion gets complicated

You can't just write Parquet files to a folder and call it a table. You need to use Iceberg's APIs through Spark or Flink to ensure ACID compliance. Skip the proper writing methods and you corrupt your table state. Concurrent writes need to go through Iceberg's transaction layer or you get conflicts and data loss.

Maintenance is your problem

As you append data, small files accumulate. You need to compact them periodically or queries slow down. Old snapshots pile up and consume storage. Orphaned files from failed writes waste space. Metadata grows and eventually impacts query planning. Iceberg has commands for all of this, but you're responsible for scheduling and monitoring them.

Access governance matters

Everyone querying your lake needs to go through the catalog. If someone bypasses it and points a tool directly at the storage bucket, they'll see inconsistent data. Preventing this requires education, access controls, and organizational buy-in.

None of this is impossible. Teams build DIY lakehouses every day. But most of the work has nothing to do with your actual business problems. It's infrastructure overhead.

This is exactly what Fivetran's Managed Data Lake Service eliminates.

How the fully managed approach works

Fivetran's service handles the lakehouse architecture so you don't have to. It manages ingestion, table format maintenance, catalog operations, and storage optimization. You point it at your sources and your cloud storage, and it builds a production-ready lakehouse.

No Iceberg expertise required.

Your data lands in open formats

Connect any of Fivetran's 700+ data connectors to feed the lake. These include databases like Postgres, MySQL, SQL Server, and MongoDB, SaaS applications like Salesforce, HubSpot, Zendesk, and Stripe, and files, events, and custom sources.

Fivetran handles extraction complexity. For databases, that means change data capture to identify new and modified rows without full table scans. For APIs, it means pagination, rate limiting, authentication refresh, and vendor-specific quirks. Everything syncs incrementally, moving only what's changed.

The service covers compute costs for ingestion. According to GigaOm research, this delivers 77-95% cost savings compared to ingesting through a warehouse's native loading mechanisms. Syncs can run as frequently as every minute for near-real-time data.

As data flows in, Fivetran writes it to your cloud storage as Parquet files. You own the storage, which means you own your data. The data sits in your AWS account, Azure subscription, or GCP project.

Currently supported destinations include:

Amazon S3 across 12 AWS regions including us-east-1, us-west-2, eu-central-1, eu-west-1, ap-southeast-1, and us-gov-west-1 for government workloads.
Azure Data Lake Storage Gen2 across 13 Azure regions including eastus2, westus3, westeurope, and australiaeast.
Google Cloud Storage across all Google Cloud regions. GCS support was added in April 2025.

A managed catalog you don't have to run

Every destination comes with a Fivetran Iceberg REST Catalog built on Apache Polaris^TM (Incubating).

The catalog is provisioned automatically. Fivetran provides the endpoint URL, catalog name, client ID, and client secret through the dashboard. Query engines authenticate via OAuth 2.0.

One key design decision: the catalog is read-only for your query engines. Fivetran writes and updates metadata. This sounds limiting but it's protective. External tools can't accidentally corrupt table state.This strategy prevents write conflicts and delayed syncs. Fivetran’s goal is to make sure that your data lands in pristine condition, on time, in the bronze layer of your data lake. The catalog stays consistent - your single source of truth.

If you'd rather use your own catalog infrastructure, the service can publish to AWS Glue, BigLake metastore, Databricks Unity Catalog in addition to the default Apache Polaris catalog.

Maintenance is automatic

This is where the "without learning Iceberg" promise really delivers. All the maintenance tasks that would otherwise consume your engineering time run automatically in the background.

‍Compaction happens during ingestion. Small files get consolidated as data is written. No separate job to schedule.‍
Snapshot expiration runs daily based on a retention period you configure.‍
Orphan file cleanup runs every other week for files older than 7 days. These are leftover files from failed operations that waste storage.‍
Metadata file cleanup keeps the current version plus 3 previous versions. Older metadata gets deleted automatically.‍
Column statistics
are gathered for query optimization. Tables with 200 or fewer columns get statistics for all columns. Larger tables get statistics for _fivetran_synced and primary key columns, plus history mode columns when history mode is enabled.

You don't schedule any of this. You don't monitor job failures. You don't tune compaction thresholds. The maintenance just happens.

Schema changes propagate without intervention

When Fivetran detects a new column in a source, it adds that column to the destination table automatically. The default behavior (ALLOW_ALL) syncs new columns without asking. You can change this via the API if you want more control (ALLOW_COLUMNS or BLOCK_ALL options are available).

When columns are removed from a source, they're soft-deleted in the destination. The column remains in the table structure, but future records contain NULL values. Historical data stays intact.

Data type changes use a type hierarchy with JSON and STRING as the largest types. When source types change, the destination column type updates to accommodate both old and new data without loss.

You don't write ALTER TABLE statements. You don't coordinate schema migrations. The service handles it.

Query your data from anywhere

Once the lake is running, you query it like any other database. Write SQL against table names. The catalog handles everything else.

This is the payoff for building an interoperable storage layer. The same data is accessible from whichever tool fits the job.

Snowflake

Snowflake queries Iceberg tables through an external table integration. Fivetran provides pre-configured SQL in the dashboard for creating the external volume and catalog integration. Run those commands in Snowflake and the tables appear.

Fivetran configures tables with AUTO_REFRESH = TRUE and a 300-second refresh interval by default. New synced data becomes queryable within 5 minutes. You can adjust this from 30 seconds up to 86,400 seconds depending on latency requirements.

From the analyst's perspective, these Iceberg tables look like native Snowflake tables. Same SQL. Same joins. Same patterns.

BigQuery

For GCS destinations, Fivetran integrates with BigLake metastore. Grant the Fivetran service account the BigQuery Admin role and tables register automatically. You query through the standard BigQuery interface while data lives on GCS at object storage pricing.

Amazon Athena and Redshift

Athena and Redshift Spectrum work through AWS Glue Data Catalog. Set up Glue integration and Fivetran registers tables automatically. Query from the Athena console or alongside native Redshift tables via Spectrum.

Spark

Apache Spark connects directly to the Iceberg REST Catalog. Configure a catalog in your Spark session and start querying immediately.

spark.conf.set("spark.sql.catalog.fivetran", 
"org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.fivetran.type", "rest")
spark.conf.set("spark.sql.catalog.fivetran.uri", "<your-catalog-url>")
# Plus OAuth credentials‍

spark.sql("SELECT * FROM fivetran.my_schema.my_table").show()

Spark retrieves metadata from the catalog and reads Parquet files directly. All of Iceberg's optimizations work automatically.

Databricks

Databricks works with both formats. Use Delta Lake tables through Unity Catalog or Iceberg tables through the REST Catalog. Fivetran is a launch partner for Databricks Managed Iceberg Tables, and joint customer adoption has grown 40% year-over-year.

Trino, Dremio, DuckDB, and many more

Trino and Dremio connect via catalog configuration pointing to the Apache Polaris endpoint. DuckDB has experimental support for Iceberg REST catalogs, letting analysts query the lake directly from Python notebooks without spinning up infrastructure.

The pattern is the same everywhere: connect to the catalog, write SQL, get results. No Iceberg knowledge required.

Avoid these pitfalls

The following will trip you up, so don’t do them.

Accessing Parquet files directly

Don't point tools at the storage bucket. Always query through a catalog. Direct access gives you inconsistent results and bypasses all the optimizations that make Iceberg worthwhile.

Manually deleting files

If you browse your storage bucket, you'll see metadata and manifest files. Don't delete them. Iceberg needs these files to function. Let automated maintenance handle cleanup.

Modifying destination tables directly

Fivetran maintains an internal schema representation. If you ALTER TABLE directly in your query engine, the internal state gets out of sync and future syncs may fail. Create views on top of base tables if you need derived columns.

Putting storage in a different region than compute

Cross-region data transfer adds latency and cost. Put your lake bucket in the same region as your primary query engines.

Using incompatible storage tiers. Glacier, Deep Archive, and Archive storage classes don't work. The retrieval latency is incompatible with interactive queries. Stick to standard tiers.

Operational basics

Despite full management, you will still need to configure and monitor a few things.

Snapshot retention

Make sure your snapshot retention threshold matches your compliance requirements and time-travel needs. Longer retention gives more historical query capability but costs more storage. For most teams, 7-30 days is a reasonable starting point.

Monitor sync status

Watch your sync statuses in the Fivetran dashboard. The service handles maintenance automatically, but you should still watch for sync failures. Set up alerting so you know when something needs attention.

Size your query engine appropriately

The storage layer can handle significant load, but your query engine needs enough compute to read the data. Size your Trino cluster, Athena workgroup, or Snowflake warehouse based on your query patterns.

Treat catalog credentials like database credentials

The client ID and secret grant read access to all tables. Share only with systems that need access, store securely, and rotate if exposed.

Transformations happen elsewhere

Managed Data Lake Service doesn't include transformation capabilities. Handle transforms upstream before ingestion or downstream in your query engine. The lake is for landing and serving data.

The point

Building an interoperable storage layer used to require deep expertise in open table formats. You'd learn Iceberg's architecture, run catalog infrastructure, schedule maintenance jobs, and manage schema evolution manually. That's a real project with real staffing requirements.

Today, you can skip all of that.

Fivetran's Managed Data Lake Service gives you everything Iceberg provides: open formats, ACID transactions, schema evolution, time travel, and multi-engine access. But you don't touch the internals. You don't run the catalog. You don't schedule compaction. You don't write maintenance scripts.

Your data lands once in an open format on storage you control. Then you query it from Snowflake, Databricks, BigQuery, Spark, Trino, or whatever else your organization uses. Same data, multiple engines, no lock-in.

You can have an interoperable data lake without suffering brain freeze from Iceberg.

Apache Iceberg is a trademark of the Apache Software Foundation.

Apache Polaris is a trademark of the Apache Software Foundation.