The Business Case for Ditching Your Data Lake

One of the most important decisions CIOs face today is where to store all of their organization's data.

For the last 10 years, many enterprises have answered this question with a two-pronged approach: First, they spin up a data lake using Hadoop or Amazon S3, which acts as a storage repository for all the raw data — both structured and unstructured — no matter the purpose. Basically, it's a place to park data before deciding what to do with it. Then, they deploy a traditional data warehouse, such as Teradata, Vertica or Netezza. Data warehouses store curated data — that is, data that is filtered and processed for a specific purpose.

Today, it continues to be common for IT consultants and other vendors to recommend a two-tier system. Before taking such a recommendation, you should consider whether you need both a data lake and a data warehouse — or if you can simplify your architecture by using the same physical system for both.

What's primarily driving the obsolescence of the data lake is the declining cost of storage. In a world where it costs about $30 per terabyte to store data in your data warehouse, it’s no longer necessary to split your data between an expensive data warehouse and a cheap data lake. Meanwhile, there are many advantages to combining your data lake and data warehouse into a multifunctional data "lakehouse."

One scalable solution for all of your data

First, cloud-based data warehouses offer unlimited storage capacity and scalability. Their serverless architecture means that compute and storage resources can be independently scaled up and down as needed. This could not be more different than the previous generation of data warehouses, where scaling up meant delivering new hardware to your data center.

Second, modern data warehouses now combine the best features of traditional data warehouses and data lakes. They continue to offer top-notch support for structured, relational data that is used by analysts to produce reports and dashboards. But they also have first-class support for semi-structured data like JSON and XML. You can now deploy one system that supports all of your data.

Asterisk: You still need to store truly unstructured data (images, audio, and video) in object storage (in S3, for example). A best practice is to store the unstructured files in object storage and then store metadata in a data warehouse, with references to the locations of the files in your data warehouse tables.

Third, combining your data lake and data warehouse into a single system reduces data "murkiness." When you use a data lake as a staging area, it frequently turns into a dumping ground with zero data governance and terrible searchability. Combining your data lake and data warehouse into a single system can make it easier to define a sequence of "stages of curation," with the right amount of governance in each stage.

Data lakes: A solution to a problem that no longer exists

Creating a data lakehouse reduces the overall complexity and latency of your systems. Because you're no longer copying data from the data lake to the data warehouse, you're not introducing latency. Moving data from one stage of curation to the next is a simple SQL query.

You may be wondering: What are the tradeoffs associated with going all-in on modernizing your data warehouse?

Data lakes were a very important technology for most of the 2010s, and many vendors, including the large cloud platforms, made huge investments in products that enable data lakes. As a result, you will find a tremendous amount of content extolling the virtues of data lakes. You may encounter internal resistance from your own team, or external resistance from consultants and systems integrators when you propose a data architecture without a data lake in the center of it.

The key is to recognize that these people may be grappling with the sunk cost fallacy. It's hard for them to recognize that the key features of data lakes have been incorporated into modern data warehouses, and data lakes are a solution to a problem that no longer exists.

At the end of the day, it's worth your time to run the numbers before you follow a consultant's recommendation for a data lake. In most cases, combining your data lake and data warehouse will give you a simpler system, with lower end-to-end latency, for less money.

A version of this blog post was originally published on Forbes Tech Council.

One scalable solution for all of your data