Data insights

Getting started with Apache Polaris Catalog in Fivetran's Managed Data Lake Service

January 8, 2026

Andrew Madson

Principal, Developer Relations

Topics

Why the catalog matters, how to configure it with popular query engines, and the essential best practices every data engineer should follow.

Apache Polaris^TM is an open-source catalog for Apache Iceberg^TM tables that implements the Iceberg REST API specification.

At its core, Apache Polaris is a metadata catalog service. It tracks the current metadata pointer and the locations of metadata files representing Iceberg tables, enabling query engines to locate and access tables through a standardized REST interface. This server-side approach differs fundamentally from legacy catalogs like Hive Metastore, which embed technology-specific code in clients.

How Fivetran Uses Apache Polaris

Fivetran's Managed Data Lake Service uses Apache Polaris to provide a streamlined REST-based Iceberg catalog.

[CTA_MODULE]

Fivetran's Managed Data Lake Service writes data as Parquet files while maintaining Iceberg table metadata. The service supports Amazon S3, Azure Data Lake Storage (ADLS Gen2), Google Cloud Storage (GCS), and Microsoft OneLake as destinations.

The Fivetran Iceberg REST Catalog is built on Apache Polaris and serves as the default catalog for all Iceberg tables. Each data lake destination has its own dedicated catalog instance, automatically configured based on your destination's details.

Only Fivetran can write or update catalog metadata. This approach ensures data integrity and prevents conflicts from external modifications.

Getting started: Obtaining your credentials

From the Fivetran dashboard, navigate to Catalog integration > Base configuration to obtain four essential values:

Value	Description
Polaris server endpoint	The REST catalog URL
Polaris catalog name	The catalog identifier
Client ID	OAuth client identifier
Client secret	OAuth client secret (displayed only once)

Important: Copy your client secret immediately, because it's shown only once. You can regenerate it via the dashboard, but you'll need to update all your engine configurations.

Always query through the catalog

This is the most important concept to remember: always query Iceberg tables through the catalog, never directly against underlying Parquet files.

Once your catalog is configured, treat Fivetran-managed tables like any other logical tables in a database or warehouse.

Why you shouldn’t query raw files

Querying Parquet files directly creates several issues that Iceberg's metadata layer solves:

File discovery overhead

With raw Parquet folders, query engines must check each file's metadata before knowing whether to skip it. This requires expensive list and open operations across potentially thousands of files. Iceberg's two-layer metadata index (manifest list → manifests → data files) enables constant-time remote calls for scan planning regardless of table size.

No ACID guarantees

Raw Parquet provides no transaction isolation. Readers can see partial writes, concurrent writes can corrupt data, and there's no atomic commit across multiple files. Iceberg ensures serializable isolation: reads use committed snapshots, writes add files atomically, and concurrent writers use optimistic concurrency with automatic retry.

Schema evolution limitations

Renaming or dropping columns in raw Parquet often requires rewriting entire files. Iceberg tracks columns by unique IDs rather than names or positions, enabling add, drop, rename, reorder, and type promotion operations without rewriting underlying data.

The benefits of using the catalog

By querying the catalog, you can bypass the aforementioned issues and instead gain:

Snapshot isolation and transactional consistency

Iceberg tracks which files belong to a committed snapshot. Through the catalog, you always read a consistent, committed view of the table. Direct file reads can mix files from different write batches or miss logically deleted rows.

Schema evolution that doesn't break queries

Fivetran can evolve schemas over time such as adding or renaming columns. When you query by table name through the catalog, the engine uses the table's metadata to map physical columns to logical columns. File path queries lose that mapping.

File layout independence

Fivetran can reorganize data to improve performance. Catalog-based queries only care about logical fields. There's no need to hard-code paths like date=2025-01-01 in your SQL.

Delete handling and data corrections

Iceberg uses metadata to represent deletes and updates. Catalog readers understand delete files and apply them at read time. Raw Parquet queries return rows that were logically deleted.

Engine interoperability

Apache Polaris provides a single authoritative definition of each table that multiple engines can share. Whether your team uses Trino, Spark, or Snowflake external tables, everyone sees the same table definition and snapshots.

The right way vs. the wrong way

Don’t query files directly:

-- Don't do this
SELECT *
FROM 
read_parquet('s3://my-bucket/fivetran/destination/orders/date=2025-01-01/*.parquet');‍

Right query the catalog:

-- Do this
SELECT *
FROM polaris_catalog.analytics.orders
WHERE order_date >= DATE '2025-01-01';

The query engine handles the rest: discovering the current snapshot, getting the list of data and delete files, and planning an efficient scan using column pruning.

‍

Rule of thumb:

If your SQL contains a raw cloud storage path for Fivetran-managed data, replace it with catalog.schema.table names. Let the catalog handle the metadata for you.

Query engine configuration

The following sections provide configuration details for popular query engines. All examples use OAuth2 authentication with the credentials from your Fivetran dashboard.

Apache Spark

Requirements:

Apache Iceberg 1.9.0+ with Spark 3.4 or 3.5. Include the iceberg-aws-bundle package for S3 storage access.

spark.sql.catalog.<catalog_name> = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<catalog_name>.type = rest
spark.sql.catalog.<catalog_name>.uri = <polaris_server_endpoint>
spark.sql.catalog.<catalog_name>.warehouse = <catalog_name>

# OAuth2 authentication
spark.sql.catalog.<catalog_name>.credential = '<CLIENT_ID>:<CLIENT_SECRET>'
spark.sql.catalog.<catalog_name>.scope = 'PRINCIPAL_ROLE:ALL'
spark.sql.catalog.<catalog_name>.token-refresh-enabled = true
spark.sql.catalog.<catalog_name>.oauth2-server-uri = <polaris_endpoint>/v1/oauth/tokens

# Enable credential vending for storage access
spark.sql.catalog.<catalog_name>.header.X-Iceberg-Access-Delegation = vended-credentials

# Required extension for Iceberg procedures
spark.sql.extensions = org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Metadata refresh occurs automatically through the catalog when accessing a table. Use REFRESH TABLE <table> for manual refresh.

DuckDB (experimental)

Requirements: DuckDB 1.2.1+. REST catalog integration is a preview feature as of March 2025.

-- Install required extensions
INSTALL httpfs; LOAD httpfs;
INSTALL iceberg; LOAD iceberg;

-- Create OAuth secret
CREATE SECRET iceberg_secret (
    TYPE iceberg,
    CLIENT_ID '<client_id>',
    CLIENT_SECRET '<client_secret>',
    OAUTH2_SERVER_URI '<polaris_endpoint>/v1/oauth/tokens'
);

-- Attach catalog
ATTACH '<warehouse_name>' AS polaris_catalog (
    TYPE iceberg,
    ENDPOINT '<polaris_server_endpoint>',
    SECRET iceberg_secret
);

Supported operations: SELECT with time-travel.

Storage: Currently limited to S3 and S3-compatible storage. GCS and ADLS support is pending.

Time-travel queries use AT (VERSION => SNAPSHOT_ID) or AT (TIMESTAMP => TIMESTAMP '...') syntax.

Snowflake

Status: REST catalog integration for external Iceberg tables is generally available for reads (since June 2024). Write support is in preview (July 2025).

CREATE CATALOG INTEGRATION fivetran_catalog_int
    CATALOG_SOURCE = ICEBERG_REST
    TABLE_FORMAT = ICEBERG
    CATALOG_NAMESPACE = '<namespace>'
    REST_CONFIG = (
        CATALOG_URI = '<polaris_server_endpoint>'
        CATALOG_NAME = '<catalog_name>'
    )
    REST_AUTHENTICATION = (
        TYPE = OAUTH
        OAUTH_TOKEN_URI = '<polaris_endpoint>/v1/oauth/tokens'
        OAUTH_CLIENT_ID = '<client_id>'
        OAUTH_CLIENT_SECRET = '<client_secret>'
        OAUTH_ALLOWED_SCOPES = ('<scope>')
    )
    ENABLED = TRUE
    REFRESH_INTERVAL_SECONDS = 300;

We recommend setting AUTO_REFRESH = TRUE when creating Iceberg tables. The default refresh interval is 300 seconds (5 minutes). The REFRESH_INTERVAL_SECONDS parameter accepts values from 30 to 86400.

For manual refresh: ALTER ICEBERG TABLE <table> REFRESH '<metadata_file_path>'.

Trino and Starburst

Create a catalog properties file (e.g., etc/catalog/polaris.properties):

connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=<polaris_server_endpoint>
iceberg.rest-catalog.warehouse=<catalog_name>

# OAuth2 configuration
iceberg.rest-catalog.security=OAUTH2
iceberg.rest-catalog.oauth2.credential=<client_id>:<client_secret>
iceberg.rest-catalog.oauth2.scope=PRINCIPAL_ROLE:ALL
iceberg.rest-catalog.vended-credentials-enabled=true

# For S3 storage
fs.native-s3.enabled=true
s3.region=<aws_region>

Query engine compatibility by cloud provider

Query engine	AWS S3	Azure ADLS	GCS
Apache Spark	✓	✓	✓
DuckDB	✓	✓	✓
Snowflake	✓	✓	✓
Starburst Galaxy	✓	✓	✓
Databricks	✓	✓	✓
Dremio	✓	✓	✓
BigQuery	-	-	✓
Redshift	✓	-	-
Azure Synapse	-	✓	-

Best practices summary

Always query through the catalog. Never query underlying Parquet files directly. Use catalog.schema.table names in all your SQL.

Protect metadata files. No recovery mechanism exists for external modifications. Fivetran maintains exclusive write authority for good reason.

Understand DuckDB limitations. REST catalog support is experimental with limited DML operations. Plan accordingly if DuckDB is your primary engine.

Configure metadata refresh appropriately. Different engines have different refresh mechanisms. Match your refresh intervals to your data freshness requirements.

Balance snapshot retention. Retention policies should balance storage costs against time-travel requirements. More snapshots mean more flexibility but higher storage costs.

What's next

Apache Polaris is expected to graduate from the Apache Incubator by late 2025. DuckDB REST catalog support is expected to reach general availability, and Snowflake write support is currently in active preview.

Welcome to Apache Polaris!

Apache Iceberg is a trademark of the Apache Software Foundation.

Apache Polaris is a trademark of the Apache Software Foundation.