Data warehouses, data marts and data lakes combine business data and provide users with a platform to guide business decisions. They all have distinct peculiarities and use cases. Although different individuals and companies might define each technology slightly differently, we will describe their essential attributes.
What is a data warehouse?
A data warehouse is a relational database designed for analytical queries. Analytical queries typically involve combining and summarizing values from a large number of records. For instance, an analytics team might turn a stream of events into a trendline with a count of events by day.
Data warehouses differ in design philosophy from transactional or operational databases, which perform frequent queries and updates to individual records. An example is adding, removing, and purchasing items from a cart on an ecommerce website. This basic difference in design means you must not use the two interchangeably, as they are optimized, at a very basic structural level, for fundamentally opposite kinds of operations. Traditionally, data warehouses were hosted in on-premise data centers. Increasingly, data warehouses are now based in the cloud. The most advanced cloud-based data warehouses are “serverless,” meaning that compute and storage resources can be independently scaled up and down as needed. Modern cloud data warehouses have become extremely accessible to organizations with modest resources.
Modern cloud data warehouses easily integrate with business intelligence platforms, through which analysts access business data to produce reports and dashboards. Also, data warehouses offer a fine degree of user permissions and access control, an essential feature of data governance.
What is a data mart?
Traditionally, data marts were physically separate databases. In the modern conception, a data mart is an organizational structure within a data warehouse. Data marts may contain data from a smaller range of sources and summarized, rather than raw data. The purpose of a data mart is to make analytics convenient and accessible to specific teams and business units.
The tables in traditional data marts were often of minimal size, typically <100,000 rows, and completely siloed. The modern approach is to build logical data models, i.e. views and materialized views, directly in a cloud-based, enterprise-wide data warehouse. This a simpler, more flexible approach that does not require separate machines, enables whatever levels of access the organization deems appropriate, and leverages the capabilities of a cloud-based data warehouse.
Depending on how an organization implements its technology and organizes its analytics team, the specifics of ownership and access for a data mart can vary. In some cases, teams and business units may be wholly responsible for their own data marts, and the data marts may effectively be siloed. In other cases, boundaries and access may be looser.
Like data warehouses, data marts easily integrate with business intelligence platforms.
What is a data lake?
Data warehouses and data marts are predicated on the assumption that important enterprise data is structured. Structured data follows predictable formats, is easily interpreted by a machine, and can be stored in a relational database. A data lake, by contrast, is an object or file store that can easily accommodate a large volume of raw, unstructured data such as free-form text, images, videos and other media, as well as structured data. The most basic use of a data lake is to comprehensively store huge volumes of data before deciding what to do with it. In this approach, the data lake is a staging area for a data warehouse. Another use is to train a machine learning application using a very large set of unstructured training data.
The chief disadvantage of data lakes is their “murkiness.” Data lakes can be comprehensive at the expense of easily accessible content. An exceptionally disorganized and poorly governed data lake can quickly become so murky that it becomes a data swamp. Data in data lakes can’t easily be accessed or joined using SQL or most business intelligence platforms, making it generally unsuited for use by analysts. In most cases, data warehouses are the more appropriate repository for structured business data used in analytics.
From the standpoint of data governance, data lakes do not offer a fine level of user permission and access control.
What a data lake can do that a data warehouse cannot is store large quantities of media such as documents, images, videos, and audio. These media can serve as training and validation sets for machine learning models.
Some new technologies combine characteristics of both data warehouses and data lakes. Some data lakes now incorporate characteristics of data warehouses such as ACID (atomicity, consistency, isolation, durability) transactions and schema enforcement as features to make data less “murky.” Likewise, data warehouses now sometimes support less-structured data and data science tools and languages usually associated with data lakes such as Apache Spark and Python. A data repository that combines characteristics of a data lake as well as a data warehouse may be referred to as a data lakehouse.
Data warehouses vs. data marts vs. data lakes
In short, we can compare the three different technologies like so: