Data insights

What is a data lake?

July 5, 2024

Charles Wang

Lead Product Evangelist

Topics

Data lakes serve as affordable, scalable repositories for all forms of data and play a central role in analytics.

Data lakes, like data warehouses, serve as central repositories for business data that users can analyze to guide business decisions. Data warehouses are predicated on the assumption that important enterprise data is structured. Structured data follows predictable formats, is easily interpreted by a machine and can be stored in a relational database.

Data lakes, by contrast, are object or file stores that can easily accommodate large volumes of both raw, unstructured data such as free-form text, images, videos and other media, as well as structured, relational data like tables organized into schemas.

Data lakes date from around the same time as the phrase “big data,” first seeing adoption in the early 2010s, when data professionals determined that relational stores were not flexible enough to support complicated analytics and data science use cases, especially those that depend on huge volumes of data, unstructured data or streaming.

How to use a data lake

Historically, many data teams used data lakes to comprehensively store huge volumes of data before modeling and loading it into a data warehouse, using the data lake as an ELT staging area. Data teams also used data lakes as a specialized destination for applications that depended on unstructured or streaming data. At companies with more varied and complex data use cases, this sometimes would (and still does) lead to separate data architectures for analytics and operational uses.

This calculus has changed with the emergence of open table formats, such as Delta Lake and Apache Iceberg^TM, and integration with data catalogs, such as Unity Catalog and Purview. Governed data lakes, also called data lakehouses, combine the scalability and flexibility of data lakes with the analytics readiness of data warehouses. With ACID (atomicity, consistency, isolation, durability) transactions and schema enforcement, it is increasingly practical to use data lakes as a single source of truth for all analytical and operational purposes. Rather than using a data lake as a staging area for a data warehouse, data teams can use a medallion architecture (bronze, silver and gold) within the data lake to prepare data for analytics. At the same time, the data lake can accommodate large volumes and streaming data integration. Data lakes are also modular, decoupling compute and storage, enabling organizations to mix and match compute engines and file storage platforms as needed.

How not to use a data lake

The most serious potential pitfall of using a data lake is “murkiness.” Without robust data governance, data lakes can easily turn into data swamps where unusable data is dumped and mixed with valuable data, making the platform difficult to search or navigate.

Moreover, without the use of open table formats, records in data lakes can’t easily be accessed using SQL or most business intelligence platforms, making data lakes generally unsuited for business intelligence or reporting.

If your organization’s analytics use cases depend entirely on modest volumes of relational data and you don’t foresee streaming use cases or generative AI applications for your organization’s data, it may make more sense to use a data warehouse (for now!). Otherwise, a governed data lake or data lakehouse may be a better choice.

Data lakes and the shift to Open Data Infrastructure

The use of open table formats, like Apache Iceberg and Delta Lake, are transforming data lakes from raw storage into governed, multi-engine platforms, positioning them as the foundation of Open Data Infrastructure. At its core, Open Data Infrastructure requires separating storage from compute, meaning data lands once in open, standards-based formats for use by any engine, whether a warehouse for BI or an ML runtime. This eliminates data duplication, cutting costs and governance complexity. The decoupled design is vital for AI agents, which require continuous, large-scale data access across multiple engines, enabling them to be served without structural rebuilds.

Governed data lakes have a bright future

Regardless of whether the governed data lake is the best solution for your data needs right now, it is worth keeping up with the technology as it continues to evolve. Although we have historically been skeptical of the data lake for general analytics use cases, Fivetran now enthusiastically offers a Managed Data Lake Service. There are several reasons that we believe the future of the governed data lake is bright:

Data needs continue to grow in scale, volume and complexity, making the cost advantages and flexibility offered by data lakes more meaningful.
Data lakes are increasingly capable of performing the same functions as data warehouses, with support for cataloging, governance and tabular data.
Emerging use cases, such as those enabled by generative AI, make it ever more valuable to have all of an organization’s data in one place.

As a single source of truth, data lakes will continue to form the keystone of the modern data stack, a suite of tools and technologies used to make data from disparate sources available on a single platform.

To experience the power of Fivetran Managed Data Lake Service for yourself, sign up for a demo or a trial.

Apache Iceberg is a trademark of the Apache Software Foundation.

Topics