What is a data lake?

Data lakes serve as central destinations for business data and offer users a platform to guide business decisions.
February 4, 2022

Data lakes, like data warehouses and data marts, serve as central destinations for business data and offer users a platform to guide business decisions. Data warehouses and data marts are predicated on the assumption that important enterprise data is structured. Structured data follows predictable formats, is easily interpreted by a machine and can be stored in a relational database. 

Data lakes, by contrast, are object or file stores that can easily accommodate large volumes of both raw, unstructured data and structured, relational data. That may include free-form text, images, videos and other media, as well as tables neatly organized into schemas. 

Data lakes date to the early 2010s, when some data professionals determined that relational stores were not flexible enough to support complicated analytics and data science use cases, especially those that depended on unstructured data.

How to use a data lake

The simplest way to use a data lake is to comprehensively store huge volumes of data before modeling it and loading it to a data warehouse. This approach is a pure expression of ELT and uses the data lake as a staging area. Besides supporting media files and unstructured data, the main advantage of this approach is that you don’t have to design a schema for your data beforehand.

The second way to use a data lake is as a specialized destination for specific artificial intelligence or machine learning applications that depend on unstructured data for training sets. What a data lake can do that a data warehouse cannot is store large quantities of media such as documents, images, videos and audio. These media can be organized into training and validation sets for machine learning models.

Data lakes are popular for both use cases and top cloud offerings include AWS data lake, Google Cloud Storage and Microsoft Azure data lake. 

How not to use a data lake

A serious potential pitfall of using a data lake is “murkiness.” Without a robust approach to data governance, data lakes can easily turn into data swamps where unwanted or unused data is dumped and valuable data is difficult to search or navigate. The very lack of structure in a data lake makes it difficult to govern.

Moreover, records in data lakes can’t easily be accessed or joined using SQL or most business intelligence platforms, making data lakes generally unsuited for use by analysts.

If your organization’s analytics use cases depend wholly on relational data, a data warehouse generally makes more sense. For a deeper treatment of the subject, read our post on data lakes vs. data warehouses.

Trends to look out for

New technologies, such as AWS Lake Formation and Databricks Data Lakehouse, combine characteristics of both data warehouses and data lakes. Some data lakes now incorporate characteristics of data warehouses such as ACID (atomicity, consistency, isolation, durability) transactions and schema enforcement as features to make data less “murky.” Likewise, data warehouses now sometimes support less structured data and data science tools and languages usually associated with data lakes, such as Apache Spark and Python. A data repository that combines characteristics of a data lake architecture as well as a data warehouse may be referred to as a data lakehouse.

However these technologies evolve, single sources of truth such as data warehouses and data lakes will continue to form the lynchpin of the modern data stack, a suite of tools and technologies used to make data from disparate sources available on a single platform. These activities are collectively known as data integration and are a prerequisite for analytics.

To learn more, download The Essential Guide to Data Integration.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.