What is a data lake?

Data lakes serve as affordable, scalable repositories for all forms of data and play a central role in analytics.
July 5, 2024

Data lakes, like data warehouses, serve as central repositories for business data that users can analyze to guide business decisions. Data warehouses are predicated on the assumption that important enterprise data is structured. Structured data follows predictable formats, is easily interpreted by a machine and can be stored in a relational database. 

Data lakes, by contrast, are object or file stores that can easily accommodate large volumes of both raw, unstructured data such as free-form text, images, videos and other media, as well as structured, relational data like tables organized into schemas. 

Data lakes date from around the same time as the phrase “big data,” first seeing adoption in the early 2010s, when data professionals determined that relational stores were not flexible enough to support complicated analytics and data science use cases, especially those that depend on huge volumes of data, unstructured data or streaming.

How to use a data lake

Historically, many data teams used data lakes to comprehensively store huge volumes of data before modeling and loading it into a data warehouse, using the data lake as an ELT staging area. Data teams also used data lakes as a specialized destination for applications that depended on unstructured or streaming data. At companies with more varied and complex data use cases, this sometimes would (and still does) lead to separate data architectures for analytics and operational uses.

This calculus has changed with the emergence of open table formats, such as Delta Lake and Iceberg, and integration with data catalogs, such as Unity Catalog and Purview. Governed data lakes, also called data lakehouses, combine the scalability and flexibility of data lakes with the analytics readiness of data warehouses. With ACID (atomicity, consistency, isolation, durability) transactions and schema enforcement, it is increasingly practical to use data lakes as a single source of truth for all analytical and operational purposes. Rather than using a data lake as a staging area for a data warehouse, data teams can use a medallion architecture (bronze, silver and gold) within the data lake to prepare data for analytics. At the same time, the data lake can accommodate large volumes and streaming data integration. Data lakes are also modular, decoupling compute and storage, enabling organizations to mix and match compute engines and file storage platforms as needed.

How not to use a data lake

The most serious potential pitfall of using a data lake is “murkiness.” Without robust data governance, data lakes can easily turn into data swamps where unusable data is dumped and mixed with valuable data, making the platform difficult to search or navigate.

Moreover, without the use of open table formats, records in data lakes can’t easily be accessed using SQL or most business intelligence platforms, making data lakes generally unsuited for business intelligence or reporting.

If your organization’s analytics use cases depend entirely on modest volumes of relational data and you don’t foresee streaming use cases or generative AI applications for your organization’s data, it may make more sense to use a data warehouse (for now!). Otherwise, a governed data lake or data lakehouse may be a better choice.

Governed data lakes have a bright future

Regardless of whether the governed data lake is the best solution for your data needs right now, it is worth keeping up with the technology as it continues to evolve. Although we have historically been skeptical of the data lake for general analytics use cases, Fivetran now enthusiastically offers a Managed Data Lake Service. There are several reasons that we believe the future of the governed data lake is bright:

  1. Data needs continue to grow in scale, volume and complexity, making the cost advantages and flexibility offered by data lakes more meaningful.

  2. Data lakes are increasingly capable of performing the same functions as data warehouses, with support for cataloging, governance and tabular data.

  3. Emerging use cases, such as those enabled by generative AI, make it ever more valuable to have all of an organization’s data in one place.

As a single source of truth, data lakes will continue to form the keystone of the modern data stack, a suite of tools and technologies used to make data from disparate sources available on a single platform. 

To experience the power of Fivetran Managed Data Lake Service for yourself, sign up for a demo or a trial.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

What is a data lake?

What is a data lake?

July 5, 2024
July 5, 2024
What is a data lake?
Data lakes serve as affordable, scalable repositories for all forms of data and play a central role in analytics.

Data lakes, like data warehouses, serve as central repositories for business data that users can analyze to guide business decisions. Data warehouses are predicated on the assumption that important enterprise data is structured. Structured data follows predictable formats, is easily interpreted by a machine and can be stored in a relational database. 

Data lakes, by contrast, are object or file stores that can easily accommodate large volumes of both raw, unstructured data such as free-form text, images, videos and other media, as well as structured, relational data like tables organized into schemas. 

Data lakes date from around the same time as the phrase “big data,” first seeing adoption in the early 2010s, when data professionals determined that relational stores were not flexible enough to support complicated analytics and data science use cases, especially those that depend on huge volumes of data, unstructured data or streaming.

How to use a data lake

Historically, many data teams used data lakes to comprehensively store huge volumes of data before modeling and loading it into a data warehouse, using the data lake as an ELT staging area. Data teams also used data lakes as a specialized destination for applications that depended on unstructured or streaming data. At companies with more varied and complex data use cases, this sometimes would (and still does) lead to separate data architectures for analytics and operational uses.

This calculus has changed with the emergence of open table formats, such as Delta Lake and Iceberg, and integration with data catalogs, such as Unity Catalog and Purview. Governed data lakes, also called data lakehouses, combine the scalability and flexibility of data lakes with the analytics readiness of data warehouses. With ACID (atomicity, consistency, isolation, durability) transactions and schema enforcement, it is increasingly practical to use data lakes as a single source of truth for all analytical and operational purposes. Rather than using a data lake as a staging area for a data warehouse, data teams can use a medallion architecture (bronze, silver and gold) within the data lake to prepare data for analytics. At the same time, the data lake can accommodate large volumes and streaming data integration. Data lakes are also modular, decoupling compute and storage, enabling organizations to mix and match compute engines and file storage platforms as needed.

How not to use a data lake

The most serious potential pitfall of using a data lake is “murkiness.” Without robust data governance, data lakes can easily turn into data swamps where unusable data is dumped and mixed with valuable data, making the platform difficult to search or navigate.

Moreover, without the use of open table formats, records in data lakes can’t easily be accessed using SQL or most business intelligence platforms, making data lakes generally unsuited for business intelligence or reporting.

If your organization’s analytics use cases depend entirely on modest volumes of relational data and you don’t foresee streaming use cases or generative AI applications for your organization’s data, it may make more sense to use a data warehouse (for now!). Otherwise, a governed data lake or data lakehouse may be a better choice.

Governed data lakes have a bright future

Regardless of whether the governed data lake is the best solution for your data needs right now, it is worth keeping up with the technology as it continues to evolve. Although we have historically been skeptical of the data lake for general analytics use cases, Fivetran now enthusiastically offers a Managed Data Lake Service. There are several reasons that we believe the future of the governed data lake is bright:

  1. Data needs continue to grow in scale, volume and complexity, making the cost advantages and flexibility offered by data lakes more meaningful.

  2. Data lakes are increasingly capable of performing the same functions as data warehouses, with support for cataloging, governance and tabular data.

  3. Emerging use cases, such as those enabled by generative AI, make it ever more valuable to have all of an organization’s data in one place.

As a single source of truth, data lakes will continue to form the keystone of the modern data stack, a suite of tools and technologies used to make data from disparate sources available on a single platform. 

To experience the power of Fivetran Managed Data Lake Service for yourself, sign up for a demo or a trial.

Related blog posts

What is a data lakehouse?
Data insights

What is a data lakehouse?

Read post
A deep dive into data lakes
Data insights

A deep dive into data lakes

Read post
Why Fivetran supports data lakes
Product

Why Fivetran supports data lakes

Read post
No items found.
What is a data lakehouse?
Blog

What is a data lakehouse?

Read post
Fivetran at Databricks Data + AI Summit 2024: Key takeaways
Blog

Fivetran at Databricks Data + AI Summit 2024: Key takeaways

Read post
Announcing Fivetran Managed Data Lake Service
Blog

Announcing Fivetran Managed Data Lake Service

Read post

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.