The definitive guide to data lake catalogs
The definitive guide to data lake catalogs

A data lake fails when users cannot find, trust, or govern its data. As data volumes grow, the absence of a central system of record makes analytics unreliable and blocks the development of AI models. This is the default state for any unmanaged data lake.
A data lake catalog prevents this failure. It is the metadata layer that imposes a logical structure on physical data files, making the data discoverable, secure, and ready for use.
This guide outlines the technology that makes this possible. We will analyse catalog architecture, the function of open table formats, and the leading solutions. It provides a technical framework for selecting and integrating the correct catalog for a given data stack. A catalog is the component that makes a data lake a functional asset.
Why a modern data lake requires a catalog
An unmanaged data lake is an unreliable component in a data architecture. As data is added from multiple sources, the absence of a central management layer slows development, reduces data quality, and introduces security vulnerabilities. A catalog is the mechanism that eliminates these problems by providing a single, authoritative registry for all data in the lake.
What is a data lake catalog?
A data lake catalog is a centralized repository for metadata. It maps logical database constructs like schemas, tables, and columns to the physical data files stored in object storage. The catalog does not contain the data itself; it includes the structured information that describes the data, including its format, location, and schema. This makes it the primary interface for query engines, processing frameworks, and users to discover and access data programmatically.
The failures of an unmanaged data lake
An unmanaged data lake is inherently unreliable for production workloads. It fails to provide the necessary guarantees for data discoverability, trust, and governance.
Locating a specific dataset becomes a manual process of searching through object storage, reading undocumented scripts, or relying on the memory of individual engineers. This work is unscalable and consumes the majority of a data team's time. Projects are delayed or cancelled because the required data cannot be located, queried, and verified quickly enough to be useful. Technical teams are forced to act as data librarians instead of analysts and developers.
Discovered data is unusable if it is not verifiable. Without a catalog, a user is presented with raw files and no context. They cannot answer fundamental questions:
- What is the origin of this data?
- When was it last updated?
- What transformations have been applied?
- Does it contain sensitive information?
This lack of lineage and documentation makes the data untrustworthy. As a result, teams revert to using legacy data silos, which leads to inconsistent analysis and incorrect business decisions.
Managing permissions at the file or bucket level is insufficient for enterprise security. An unmanaged data lake has no mechanism for enforcing fine-grained access controls, such as permissions at the table, row, or column level. This is a direct security vulnerability. It makes it impossible to guarantee that users only see the data they are authorised to access, which exposes the organisation to compliance violations and data breaches.
How data lake catalogs work
A data lake catalog is a logical abstraction layer that operates on top of a physical storage system like Amazon S3 or Azure Blob Storage. It decouples the logical organisation of data (schemas, tables) from the physical data files. Instead of scanning directories, a query engine like Apache Spark first queries the catalog to resolve the schema and physical location of the data it needs to read. This makes efficient and governed data management possible at scale.
The central function of metadata management
A catalog’s primary role is to manage the authoritative source for all types of metadata. This allows data engineers and data scientists to programmatically discover and understand data assets. Catalogs categorize this metadata into three distinct types:
- Technical metadata is the structural information required by machines. It includes table schemas (column names, data types), partition strategies, and the physical location of data files. This metadata is essential for query optimization and data processing.
- Operational metadata documents data lineage and processing history. It includes output from ETL/ELT job runs, data freshness timestamps, version history, and access logs. This information is used to debug data pipelines and audit access.
- Business metadata provides context for data consumers. It assigns business definitions to tables and columns, documents data ownership, and applies data governance tags (e.g., PII, sensitive). This connects raw data to business concepts and supports compliance with internal governance policies.
Understanding schema in the data lake
A data lake is designed to store data in its raw format, a design that is incompatible with the rigid schema enforcement of a traditional data warehouse. A catalog is the component that enables schema management in this flexible environment. The two primary models are schema-on-write and schema-on-read.
In the schema-on-write model, users define a table’s schema before writing data. The system validates all incoming data against this schema and rejects records that do not conform. This process guarantees high data quality and optimizes query performance because the data structure is fixed and predictable. The trade-off is low flexibility and slower data ingestion, as it requires upfront data transformation.
In the schema-on-read model, users load raw data into storage without an initial schema validation. The schema is applied only when the data is queried. This model provides maximum flexibility and high-speed ingestion. The trade-off is lower query performance and the risk of inconsistent data if not managed correctly. Data catalogs make schema-on-read a viable enterprise strategy by storing and enforcing the schema definitions that query engines need to interpret the raw files correctly.
Schema-on-read vs. schema-on-write
The critical role of open table formats
Storing data as a collection of static files in object storage prevents reliable data modification. A failed write job can leave data in a corrupted state, and changing a single row requires rewriting entire files. Open table formats (OTFs) like Apache Iceberg, Apache Hudi, and Delta Lake are technologies that solve these problems. They are an open-source metadata layer that brings the reliability of an ACID-compliant database to the data lake.
OTFs provide three critical features:
- ACID transactions: Ensure that operations are atomic. A write job either completes successfully or it fails, preventing data corruption.
- Schema evolution: Provide safe methods for altering a table’s schema, such as adding or renaming columns, without rewriting the underlying data files.
- Time travel. They maintain a version history, allowing users to query data as it existed at a previous point in time. This is a requirement for auditing and reproducing analyses.
A data lake catalog integrates directly with the table format. The OTF manages the file-level metadata, tracking which physical files constitute a table at any given version. The catalog stores the high-level pointer to the OTF’s current root metadata file.
When a query engine requests a table, it first asks the catalog for the table's location. The catalog points the engine to the correct table format metadata, which then provides the engine with the exact list of data files to read for that version of the table. This integration is the foundation of a modern, reliable data lake.
An overview of catalog types and popular options
The choice of a data lake catalog is determined by architecture, cloud ecosystem, and specific technical requirements for data governance and data management. Not all data catalogs are interchangeable. The three primary architectures are the Hive Metastore, managed cloud services, and transactional catalogs.
Common catalog architectures
The Apache Hive Metastore is the original catalog for the Hadoop ecosystem and remains a de facto open-source standard. It operates on a client-server model, storing all metadata in an external relational database like PostgreSQL or MySQL. Compute engines like Apache Spark and Trino connect to the metastore as a client to retrieve schema and location information before reading data from the data lake.
Its primary advantage is that it is open source and supported by nearly every tool in the big data ecosystem. But overhead is substantial, as data teams are responsible for provisioning, managing, and scaling the metastore service and its supporting database. This encompasses high-availability setup, backups, and performance optimization.
It is a common choice for on-premises deployments or in cloud environments where a team wants to avoid vendor lock-in. It often runs with compute clusters like Amazon EMR.
Service-based catalogs are fully managed, and cloud providers offer serverless solutions. The AWS Glue Data Catalog is the most prominent example. It provides a Hive Metastore-compatible API but eliminates all operational management. There are no servers to provision or databases to manage. It is a pay-per-use service that scales automatically.
These modern data catalogs integrate with the provider's cloud ecosystem. The AWS Glue Data Catalog, for instance, is the default catalog for services like Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum. This tight integration simplifies infrastructure and security, as authentication and authorization are managed through the cloud's native identity and access management (IAM) service. The primary trade-off for this convenience is that its features are optimized for a single cloud provider’s environment, which can create challenges for multi-cloud strategies.
Transactional catalogs are a newer architecture designed to provide atomic, cross-table transactions with Git-like semantics. Project Nessie is the leading open-source implementation in this category. It allows data engineers to branch, merge, and tag the entire state of the catalog, treating data operations as code.
This model enables use cases that are impossible with traditional catalogs.
For example, a developer can create an isolated branch to ingest and transform new data, validate its quality, and then atomically merge it into the main production branch.
This ensures that production queries are never exposed to partial or unvalidated data. It also provides a complete, versioned history of the data, which is critical for reproducibility and governance. Transactional catalogs solve a different set of problems, focusing on the reliability and atomicity of data engineering workflows rather than just schema discovery.
Comparison of leading data lake catalog solutions
How to choose the right data lake catalog
Selecting a catalog requires evaluating a data stack’s architecture, operational model, and data governance requirements. The evaluation must prioritize the following five criteria.
Openness and interoperability
A catalog must prevent vendor lock-in. The primary measure of openness is compatibility with the Apache Hive Metastore API, the most widely supported standard for metadata access. A compatible API ensures that compute engines like Apache Spark, Trino, and Presto can integrate with the catalog. Proprietary APIs complicate the use of open-source tools and hinder multi-cloud strategies.
Performance and scalability
The catalog must handle the metadata load of the entire data lake without creating a bottleneck. As the number of tables, partitions, and files grows, the catalog's performance becomes critical.
For a self-hosted Hive Metastore, this demands careful capacity planning for its database and service instances. A managed service like the AWS Glue Data Catalog handles scaling automatically. Still, its API rate limits must be validated against the peak demands of production ETL jobs and concurrent interactive queries.
Integration with query and processing layers
A catalog must integrate with the specific compute frameworks used by data engineers and data analysts. The integration must support engine-specific features, such as partition pruning, to ensure efficient query execution. The system must also handle concurrent requests from multiple jobs and users without creating contention.
Security and fine-grained governance
A catalog is a control plane for enforcing data governance policy. It must support fine-grained access control at the table, column, and row levels. In a cloud environment, this requires the catalog to use the native identity and access management (IAM) service for authentication and authorisation. This allows administrators to manage data and infrastructure permissions in a single system.
The operational model
The choice between a managed service and a self-hosted platform is a trade-off between operational cost and control. A managed catalog eliminates all infrastructure management, which is the correct choice for teams that need to minimize operational overhead.
A self-hosted catalog provides complete control over configuration, performance, and security, but it requires a dedicated engineering team to manage and scale the infrastructure.
Decision guide: Selecting a catalog
And you want a serverless, fully managed catalog that integrates with AWS services:
→ Use AWS Glue Data Catalog
And you need a managed catalog that unifies metadata across BigQuery and your data lake:
→ Use Google Cloud Dataplex
If you want an open-source standard and can manage your own infrastructure:
→ Use Apache Hive Metastore
If you need version control, CI/CD support, and cross-table transactions:
→ Use Project Nessie
Integrating a catalog with the modern data stack
A data lake catalog functions as a service discovery layer for data, allowing disparate tools for ingestion, processing, and analytics to interoperate without direct dependencies on the physical storage layer.
The query and processing layer
Compute engines and processing frameworks use the catalog to plan and execute queries. An engine like Presto or Apache Spark connects to the catalog's API endpoint (often via a JDBC or ODBC connector). When a user submits a query for a table, the engine performs the following steps:
- It calls the catalog to retrieve the table's schema, physical location, partition information, and a pointer to the current metadata file of its open table format (e.g., Apache Iceberg).
- Using this information, the engine develops an optimized query plan, including partition pruning to minimize the amount of data that needs to be scanned from object storage like Amazon S3.
- The engine reads the required data files directly from object storage and executes the query.
This architecture decouples compute from storage, allowing an organisation to use multiple specialised query engines on the exact copy of the data.
The data movement platform
Automated data movement platforms populate the data lake with fresh, analysis-ready data. The platform must write data in a way that is atomic and immediately discoverable via the catalog. A modern platform achieves this by integrating at the table format layer.
When new data is ingested, the platform performs a transactional write operation. It writes the latest data files to object storage and, upon successful completion, updates the open table format metadata to register a new table version. This new version is then made visible in the data lake catalog.
This ensures that a query engine will only ever see a consistent, complete state of the data, which is a core requirement for reliable data management. This tight integration is necessary to prevent the ingestion of partial data and to support workloads that depend on low-latency data, such as near real-time analytics.
Best practices for catalog integration
A catalog's utility depends on its integration and the operational discipline of the teams that use it. The following practices ensure a catalog does not become an unmanaged or obsolete component.
Enforce the catalog as the single source of truth
All tools, applications, and users must interact with data through the catalog. The primary anti-pattern is allowing individual tools or teams to bypass the catalog and read data files directly from object storage.
This practice creates a fragmented metadata environment where different systems operate with different schemas and data definitions. It leads to inconsistent analysis and duplicated data engineering work, and it makes central data governance impossible to enforce. The catalog must be the exclusive gateway to the data lake.
Automate metadata registration
A catalog that relies on manual updates by data engineers will fail. It will inevitably become outdated, and its information will no longer be trusted. To prevent this, data movement platforms and ETL/ELT jobs must be configured to automatically register new tables, schemas, and partitions with the catalog API as a standard part of the ingestion process.
This automated registration ensures that data is discoverable the moment it becomes available for use and is a core requirement for reliable data management.
Implement governance policies within the catalog
Governance must be implemented as a feature within the catalog, not as a separate external layer, to make it programmatic, consistent, and auditable. For instance:
- The catalog must be the control plane where security and data quality rules are defined and enforced.
- Access control policies should be linked directly to tables, views, or columns within the catalog itself.
- Data quality metrics and validation results should be stored as technical metadata alongside the data they describe.
Decouple the catalog from a single compute engine
A catalog should be deployed as a standalone, central service, independent of any single query engine. The common anti-pattern is using a catalog that is tightly bundled with one specific compute framework, such as the original Apache Hive metastore's dependency on the Hive runtime.
This creates vendor lock-in and prevents an organisation from using specialised tools for different workloads. A flexible architecture allows a team to run Apache Spark for large-scale ETL and a separate SQL engine for interactive queries, with both operating on the same data through the same central catalog.
Operationalize your data with Fivetran
A data lake catalog is a foundational requirement for building a reliable data platform. Without it, a data lake is an unmanaged collection of files that cannot provide the discoverability, trust, or data governance necessary for production workloads.
The catalog imposes a logical, structured interface on physical storage and integrates with the entire data stack from ingestion to analytics. It provides a single, authoritative source of truth for all data assets, which is the difference between a functional platform and a collection of inaccessible data silos.
A reliable data lake requires a reliable data movement platform. Fivetran automates the ingestion of analysis-ready data, writing directly to open table formats like Apache Iceberg and Delta Lake. This ensures your data is always consistent, discoverable through your catalog, and ready for production workloads.
[CTA_MODULE]
Related posts
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.