Data insights

Governing your lakehouse with Fivetran Managed Data Lake Service

March 16, 2026

Andrew Madson

Principal, Developer Relations

THEMEN

Lakehouse governance requires observability, access control, data lifecycle management, and more. Fivetran MDLS makes it easy.

With Fivetran, ingesting data into Apache Iceberg on your data lake is the easy part. Governing it is where the rubber hits the road.

Who can access which tables? How do you keep catalog metadata consistent when multiple engines are reading from the same data? What happens when someone queries raw Parquet files instead of going through the catalog? How do you manage snapshot sprawl, orphaned files, and schema drift without making it someone's entire job?

Fivetran's Managed Data Lake Service is designed around these governance problems. It writes your data once as Parquet and maintains both Iceberg and Delta Lake metadata simultaneously, but the real value is in how it enforces consistency, controls access, and automates the lifecycle management that most teams either do poorly or not at all.

[CTA_MODULE]

A single source of truth for metadata

Governance falls apart when you don't have one authoritative answer to "what does this table look like right now?" That's the first problem the Managed Data Lake Service solves.

Every destination gets its own dedicated Fivetran Iceberg REST Catalog, built on Apache Polaris and implementing the Iceberg OpenAPI Specification. Any engine that speaks the Iceberg REST protocol can connect to it.

Here's the governance model in one sentence: only Fivetran writes metadata, and everyone else reads. The catalog is read-only from the perspective of every query engine. Fivetran updates it during every sync, and that's it. No one else touches it.

This matters because it eliminates an entire class of governance headaches. There's no scenario where Spark updates a table's schema, and Trino reads a stale version. There's no risk of a manual Glue edit breaking downstream queries. Fivetran owns the metadata, publishes it consistently, and if an external catalog like Glue drifts out of sync, Fivetran detects it during the next sync and automatically republishes the correct table version from its REST catalog.

The docs are blunt about this: don't modify tables manually or through any external catalog. Doing so can cause data integrity issues, and Fivetran will overwrite your changes anyway since it treats its own catalog as the single source of truth.

Access control: Layers that work together

Good governance means controlling who can read what, with the minimum permissions necessary. The Managed Data Lake Service stacks several access control layers together to accomplish this.

OAuth2 for catalog access

Query engines authenticate to the Fivetran Iceberg REST Catalog using OAuth2 client credentials. Each destination gets one Client ID/Secret pair, managed through the Fivetran dashboard under Catalog integration > Base configuration.

The connection details:

Catalog URI: https://polaris.fivetran.com/api/catalog/

OAuth2 Token URI: https://polaris.fivetran.com/api/catalog/v1/oauth/tokens

OAuth scope: PRINCIPAL_ROLE:ALL

Credential format: clientId:clientSecret

Warehouse parameter: your Fivetran group/destination ID

From a governance standpoint, treat these credentials like you'd treat any privileged service account. Rotate them if a team member with access leaves, and track where they're deployed.

Vended credentials: Scoped, short-lived S3 access

This is one of the stronger governance features. When a query engine requests data, the catalog issues short-lived storage credentials scoped to the exact S3 directory containing that table's files. Your engine never needs broad s3:GetObject permissions across the whole bucket.

You enable this with the access delegation header:

X-Iceberg-Access-Delegation: vended-credentials

The practical impact is significant. Instead of granting your Spark cluster read access to your entire data lake prefix, the catalog provides temporary credentials for just the files it needs, for that request. That's least-privilege access at the storage layer, enforced automatically.

IAM roles: Cross-account trust with Fivetran

Fivetran uses a cross-account IAM role assumption to access your S3 bucket. Your role trusts Fivetran's AWS account with an auto-generated, unique, and persistent External ID.

The S3 IAM policy is designed to follow least-privilege principles.

KMS encryption

If your bucket uses SSE-KMS, add a Deny statement to the IAM policy. This ensures Fivetran can't write data with the wrong encryption key, which matters for compliance regimes that require specific key management.

Lake Formation

If Lake Formation is enabled for your S3 bucket, you need to configure it. Navigate to Lake Formation console > Permissions > Data locations > Grant, select the IAM role, and enter the S3 bucket prefix as the storage location. This adds fine-grained data location permissions on top of your IAM policies. If Lake Formation isn't enabled, skip this entirely.

PrivateLink: Network-level governance

AWS PrivateLink is available on the Business Critical plan. It routes traffic through your VPC without touching the public internet. For regulated industries or any environment where network isolation is a compliance requirement, this is the path to take.

Governing multi-engine access

A lakehouse that only works with one query engine isn't much of a lakehouse. The governance challenge is enabling multiple engines to read the same tables while maintaining metadata consistency and access control. The Managed Data Lake Service handles this through two complementary catalog paths.

The REST Catalog path

Engines that support the Iceberg REST protocol connect directly to the Fivetran Iceberg REST Catalog using the OAuth2 credentials.

The Glue Catalog path

Athena and Redshift don't speak the Iceberg REST protocol. They access Fivetran-managed Iceberg tables through AWS Glue as an intermediary.

The Update AWS Glue Catalog toggle must be set during destination setup. This is a one-way door; you can't change it after saving. If your governance model requires Athena or Redshift access, enable this from the start.

Constraints to plan around: the Glue catalog must be in the same AWS Region as your S3 bucket. All Fivetran groups in a given Region share the same Glue database. AWS Glue only supports one table per schema/table name combination per Region. So you need distinct schema names for each data lake to avoid naming collisions.

The governance upside: Fivetran's self-healing mechanism keeps Glue in sync with the REST catalog automatically. If Glue drifts, the next sync fixes it.

Data lifecycle governance

Ungoverned data lakes turn into data swamps. Storage costs climb, orphan files accumulate, and nobody knows which snapshots are safe to delete. The Managed Data Lake Service automates these lifecycle concerns.

Snapshot retention

Snapshot retention is configurable through the Snapshot Retention Period dropdown in the destination setup form. There's also a Retain All Snapshots option for audit or compliance requirements that mandate full history.

A daily cleanup process identifies snapshots older than the retention period, deletes them, and removes any files that are no longer referenced by the remaining snapshots. Make sure your S3 Lifecycle configuration doesn't delete files before the retention period expires, or you'll end up with snapshots pointing at files that no longer exist.

Metadata file management

Fivetran retains the current version plus 3 previous versions of metadata files and deletes anything older. The docs recommend not deleting metadata files yourself because doing so can corrupt Iceberg tables. If your compliance team asks about metadata retention, this is the answer: current plus three, managed automatically.

Orphan file cleanup

Orphan file cleanup runs every other Saturday. Orphan files are leftovers from unsuccessful pipeline operations that aren't referenced in any table metadata. Without automated cleanup, these quietly inflate your storage bill.

Compaction and deduplication

Both happen automatically. Fivetran normalizes, compacts, and deduplicates data before writing it in Iceberg format. The copy-on-write methodology means complete file rewrites during updates, which keeps read performance high and ensures data consistency. You're never querying a chain of partially-applied changes.

Schema governance

Schema evolution is handled automatically during syncs. Fivetran creates and maintains tables without affecting sync performance.

Beyond AWS

Fivetran supports ADLS Gen2 on Azure and GCS on Google Cloud with the same dual-format approach. The Fivetran Iceberg REST Catalog works across all three clouds as the default Iceberg catalog.

Azure gets Unity Catalog and OneLake for Delta Lake tables. GCS gets BigLake Metastore for Iceberg tables queried through BigQuery (requires the BigQuery Admin role on the Fivetran service account). Unity Catalog supports Delta Lake tables only, not Iceberg, and works across all three clouds. Authentication is via Personal Access Token or OAuth 2.0 (M2M).

The governance model is consistent across clouds: Fivetran owns the metadata, the REST catalog is the source of truth, and external catalogs stay in sync through self-healing.

The governance checklist

Before you go live, make sure you've nailed these down.

Access control. IAM role with least-privilege S3 and Glue policies. KMS encryption if required. Lake Formation configured if enabled. OAuth credentials stored securely with a rotation plan. PrivateLink for network isolation if your compliance model demands it.

Catalog strategy. REST catalog for Spark, Snowflake, Trino, DuckDB, and Dremio. Glue catalog enabled if you need Athena or Redshift. Distinct schema names per data lake to avoid Glue naming collisions.

Lifecycle management. Snapshot retention period set to match your compliance and cost requirements. S3 Lifecycle policies aligned with (not shorter than) the retention period. Understanding that metadata retains current plus 3 versions, and orphan cleanup runs biweekly.

Schema governance. Awareness of the # prefix on reserved Iceberg column names. No manual table modifications through external catalogs. Downstream consumers configured to query through the catalog, never raw Parquet files.

Permanent decisions. Bucket name, prefix path, and Glue toggle finalized before saving. Naming conventions agreed on across teams.

Get these right and the Managed Data Lake Service handles the ongoing governance work: keeping catalogs consistent, cleaning up storage, evolving schemas, and enforcing that single-source-of-truth model that makes the whole thing actually governable.

Book a demo

[CTA_MODULE]

‍

Learn how the cost of a data warehouse compares with that of a data lake.

Read the GigaOm report

Experience the governance capabilities of Fivetran Managed Data Lake Service for yourself.

Start now

Topics

Data Lakes

Governance