In a survey conducted by IT consulting firm Capgemini, 77 percent enterprises said that decision-making in their organizations was completely data-driven. The same survey showed that only 39 percent of respondents were able to turn their data-driven insights into a sustained competitive advantage. That’s because the data-driven approach only works when you have access to the right data at the right time, which 69% of respondents find challenging.
Organizations use data lakes and warehouses to store large amounts of data. They use these tools in combination with business intelligence and analytics tools to gain insights and make decisions. When used correctly, your data warehouse and/or lake can support you in faster, more timely and more accurate decision making. Let’s take a detailed look at both technologies.
What is a data warehouse?
A data warehouse is a large, central repository of structured data. It receives business data from a variety of sources in a predefined format and readies it for reporting and analysis.
How does a data warehouse work?
A data warehouse follows the conventional extract, transform and load (ETL) processes where data is extracted from the source, transformed into a standardized format and loaded into the warehouse. It usually receives data from:
- Operational databases – Examples include Microsoft SQL Server, AWS Dynamo, Apache Cassandra or MongoDB.
- Transactional systems – Examples include airline reservation systems or bank account processing systems.
- All software applications that produce structured data (data in a standardized format). Examples include marketing automation tools, CRMs and data tracking and analytics tools.
Data warehouses use a schema-on-write model where data is prepped and cleaned before it is stored upon receipt. That is why data stored in data warehouses is curated, centralized and ready for use.
Why use a data warehouse?
Since data warehouses store data that’s ready for analysis, both technical and non-technical users can benefit from it. Data from a data warehouse is used in business intelligence, data visualization and batch reporting. The data analysis tools can directly connect with the data warehouse, or the warehouse itself can have built-in features. In either case, reports and dashboards — such as monthly sales reports — are usually automated to be delivered periodically, and business users can access them as needed.
How is a data warehouse different from transactional or operational databases?
Data warehouses differ in design philosophy from transactional or operational databases, which perform frequent queries and updates to individual records. An example is adding, removing and purchasing items from a cart on an ecommerce website. By contrast, data warehouses are designed to perform calculations across large groups of records at once. This basic difference in design means you must not use the two interchangeably, as they are optimized at a basic structural level for fundamentally opposite kinds of operations.
Traditionally, data warehouses were hosted in on-premise data centers. Increasingly, data warehouses are now based in the cloud. The most advanced cloud-based data warehouses are “serverless,” meaning that compute and storage resources can be independently scaled up and down as needed. Modern cloud data warehouses have become extremely accessible to organizations with modest resources.
Modern cloud data warehouses easily integrate with business intelligence platforms through which analysts access business data to produce reports and dashboards. Also, data warehouses offer a fine degree of user permissions and access control, an essential feature of data governance.
What is a data lake?
A data lake is a large repository of structured and unstructured data. It can store data from any source and in any format so that the data can be used for data discovery, advanced business intelligence and machine learning.
How does a data lake work?
A data lake uses the extract/load/transform (ELT) method when it comes to the data pipeline — i.e., the data is extracted from the source(s), loaded into the lake and transformed as needed. It uses a schema-on-read approach where the data is given structure only when it is pulled for analysis. Unlike data warehouses, where the source has to deliver structured data, or there has to be a tool that gives data structure before it is stored, all kinds of sources work with a data lake.
Data lakes store all types of documents and raw data such as user data, research data, video and media files, application data, medical imaging data and so on — the sky’s the limit. The data stored in a data lake is not ready for use — it must be standardized, cleaned and prepped before it can be used for analysis.
Why use data lakes
Data scientists and data analysts use data lakes for:
- Big data analytics – to find hidden patterns and correlations in huge collections of data, such as changing customer preferences leading to new market trends
- Machine learning – to train machine learning algorithms by feeding them large volumes of data
- Predictive analytics – to use historical data to predict future events
The chief disadvantage of data lakes is their “murkiness” or lack of structure. Data lakes can be comprehensive at the expense of easily accessible content. An exceptionally disorganized and poorly governed data lake can quickly become so murky that it becomes a data swamp.
Data in data lakes can’t be easily accessed or joined using SQL or most business intelligence platforms, making it generally unsuited for use by analysts. In most cases, data warehouses are the more appropriate repository for structured business data used in analytics.
From the standpoint of data governance, data lakes often do not offer a fine level of user permission and access control.
How data lakes and data warehouses complement each other
Data warehouses are predicated on the assumption that important enterprise data is structured. Structured data follows predictable formats, is easily interpreted by a machine and can be stored in a relational database. The most basic use of a data lake is to comprehensively store huge volumes of data before deciding what to do with it. In this approach, the data lake is a staging area for a data warehouse.
Some new technologies combine characteristics of both data warehouses and data lakes. Some data lakes now incorporate characteristics of data warehouses such as ACID (atomicity, consistency, isolation, durability) transactions and schema enforcement as features to make data less “murky.”
Likewise, data warehouses now sometimes support less structured data and data-science tools and languages usually associated with data lakes, such as Apache Spark and Python. A data repository that combines characteristics of a data lake and a data warehouse may be referred to as a data lakehouse.
Data warehouses vs. data lakes
In short, we can compare the two different technologies like so:
When to use data warehouses and data lakes
Data warehouses and data lakes form the lynchpin of the modern data stack, a suite of tools and technologies used to make data from disparate sources available on a single platform. These activities are collectively known as data integration and are a prerequisite for analytics.
The simplest approach to data integration, and the one Fivetran recommends to most organizations, is to use a data warehouse as a data repository. The data stack consists of the following components:
- Data pipeline
- Data warehouse
- Business intelligence tool
Business intelligence (BI) tools typically support access to data warehouses, not data lakes. A BI reporting and dashboard tool that feeds from a data lake is likely to be a custom solution built by a data scientist or data engineer with all of the associated expenses.
Data lakes can be used as a central repository for both structured and unstructured data while (usually) sacrificing a relational structure. Although this data stack is possible, it is not advisable. That’s because data from a data lake needs to be given structure before it can be sent to a data warehouse. When you have structured data available, you might as well send it to the BI tool to get insights. Sending it to the warehouse first will slow you down. A custom solution that receives data directly from the lake can help you get business insights quicker. If you choose to go with this stack, this is how it’ll look:
- Data pipeline
- Data lake
- Data warehouse
- Business intelligence tool
Your exact configuration will depend on your exact use case, the size and composition of your company and the skill sets of your analysts and engineers.
If all else fails, though, just start with a data warehouse.
Use data connectors to populate destinations
Data warehouses and lakes are both capable of storing large volumes of data from a wide range of sources. The challenge is data integration – moving data from applications, event streams, databases, files and more to destinations. Despite the seeming simplicity of moving data from one location to another, ensuring that the data is faithfully represented and flows in a timely manner is an architecturally complicated problem.
This problem only grows in scale and complexity as an organization adds more data sources. Building and maintaining a bespoke, in-house solution for data integration is incredibly costly in terms of time, labor and talent.
The answer is to leverage the capabilities of the modern cloud by automating and outsourcing data integration to a provider that specializes in the real-time movement of high volumes of data. Try Fivetran for yourself and experience for yourself how simple and painless data integration can be.