Guest Post: When Your DWH Is Not a DWH

The term data warehouse (DWH) can mean different things to different people. Though the exact definition can be contentious, there are a number of strange entities claiming to be data warehouses that most assuredly are not. I am sometimes engaged by a new client for the purpose of improving a data warehouse, only to find no DWH at all, or a highly unorthodox (in a bad way) implementation of one. Here are eight red flags indicating that your organization does not have a DWH, or does not use it correctly.

1) Using a production database

If you are querying against the same machine that is running your app in production, this should raise more than a few eyebrows. Developers and analysts alike must recognize that using a production database (DB) for analytics is categorically a bad idea. Analytical queries in a production environment will compete with actual customer-facing transactions for resources, interfering with your organizations operations. In the worst-case scenario, a single careless cross-join can suck up memory and grind your production DB to a complete halt. A crashed production DB means a crashed app, a 504 error on your ecommerce website, and upset customers.

By contrast, large-scale queries are not the biggest issue if they take place in a true DWH that is separated from the production environment. There, the only consequence is slowing down the next (probably) non-critical query in the queue.

There is an important conceptual difference and divergence of function between transactional databases and analytical (columnar) databases. In a nutshell, transactional databases, also called OLTP (online transactional processing) databases, are a row-based data storage technology in which a typical query selects data from across many columns but only a single row. By contrast, columnar databases select single columns across many rows. This design lends itself to the columnar operations typical of analytics (MIN, MAX, SUM, COUNT, AVG).

Although giving analysts read access to a production table seemingly allows them to use real-time data, the fact that production DBs are poorly configured for analytic queries and are at risk for crashes and stoppages completely outweighs the benefit of this configuration.

2) Replication server

A more common (and much safer) approach is a replication or slave server, in which you spin up a copy of the production database to use as a data warehouse. Because most engineers are more familiar with OLTP databases, this approach is the first instinct of many engineers, and at least protects the application by isolating analytics burdens from the production database. Queries can be run with relative impunity due to the isolation of resources from the production database.

A replication server also provides near real-time data with limited lag and keeps data consistent without any need for an ETL process. It provides useful access to a company's core data, i.e., the organization's production DB. Indeed, a production replica can serve as a useful stopgap as a company gradually moves toward adopting a DWH.

That said, a replication server has several serious disadvantages:

At best, replication servers are an interim solution to genuine data centralization with a data warehouse and a data pipeline tool.

3) ETL and read-only access DWH

Yet another option is to set up a columnar data warehouse, but to which analysts and BI teams only have read, not write access. This is quite the opposite problem from using the production database instead of allowing analysts unlimited access to the production environment. Now, the analytics environment is wholly under the control of engineers. If they are not dedicated data engineers, they will more than likely not fully understand the needs of analysts and other end users. This approach is characteristic of traditional ETL and suffers from the characteristic rigidity and brittleness of that approach. This arrangement is also a recipe for conflict between analysts and engineers, especially over issues such as transformations, normalization, and the work required for either.

The development of columnar data warehouses is a reflection of the increasing separation between production and analytical systems. By the same token, the human elements of a company should correspond with this separation. Analysts must be able to manage the creation and iteration of tables and to produce transformations, derived tables, and temporary tables as needed. The core data operations of data cleaning and validation must take place in a centralized environment curated by dedicated individuals.

4) NoSQL databases and no SQL

An alternative to data warehousing is to maintain a non-relational, NoSQL (not only SQL) database. This approach is suitable to very specific use cases, notably when scale and comprehensiveness (as well as cost control) are more important than accessibility, and in which the users of the data are highly skilled data scientists who can use the necessary tools and technologies to handle unstructured data at scale, such as Hadoop or Spark.

Data warehouses are more suitable for general analytics and business intelligence use cases, in which the end-users rely primarily on SQL [1]. Relational modeling is still very much the norm in the analytics profession, especially at smaller scales of data. If your use case mostly concerns business intelligence and your analysts mostly use SQL, then a conventional DWH is probably more appropriate.

5) Using a system that doesn't support window functions

A supposed DWH running on MySQL has been the bane of my career thus far. Although the syntax, oddities and lack of common table expressions are all definite negatives for MySQL, the real pain is a lack of window function support (prior to v. 8.0.2 in 2018). Window functions allow an analyst to partition tables into sets of rows and perform calculations over each set. Both aggregation (MIN, MAX, AVG, SUM, COUNT) and sequencing/ranking (ROW_NUMBER, RANK, FIRST, LAST, etc.) are possible within these windows. Thus window functions allowing us to compute running averages, the magnitude of intervals between rows, and many other useful figures. Probably the most indispensable of these is the ROW_NUMBER function, as it is needed to handle the duplicates and pseudo-duplicates that will no doubt creep into your DWH.

The workarounds required to produce similar outputs to window function are difficult, error-prone and unintuitive as StackOverflow can testify.

Window functions are now standard in most SQL dialects, but many real-world databases are outdated. As a result, many professionals remain unaware of window functions. This might be forgivable for an engineer who has a well curated app layer running their transactional queries, but window functions are the bread and butter of scalable, readable SQL-based transformations. Jury-rigged workarounds might allow you to blunder through the odd case, but they are guaranteed to make your SQL unreadable. Acquaint yourself with window functions and make sure you adopt a system that supports them. At the end of this article, we provide a short list of popular servers and the versions that support window functions [2].

6) Excel and the spreadsheet mart

About 62% of organizations today rely on spreadsheets. Although few people would explicitly refer to their collection of Excel sheets as a DWH, teams using most DWH architectures will revert to spreadsheets as a crutch to some extent. Points 1, 2 and 3 (using a production DB, a replica production DB, or read-only DWH) will often push analysts and business users more and more to spreadsheets to supplement failings in their core setup. The development of a spreadsheet mart is almost universal at some point on a company's journey. They are insidious in onset and extremely hard to weed out once present. It's easy to slip from one momentarily expedient decision to another until you have a tangled mess of spreadsheets in lieu of a proper data stack. Getting rid of one that has taken root is far harder.

7) Federated queries and virtualization

Some companies use federated queries through BI tools or virtualization layers, which, like a data warehouse, allow all data sources to be accessed from a single environment, but, unlike a data warehouse, do not offer the performance benefits of actual centralization.

Building reports and dashboards over raw data is generally not sustainable, as:

Reporting over raw data doesn't solve any issues of data cleaning or governance, and is effectively a pretty-looking spreadsheet mart. Federated queries and virtualization struggle with loosely structured formats such as XML and JSON, will use up your quota of API calls, and are not particularly secure.

8) A group of data marts

A final, and interesting, possibility is to pull all of your data into a data warehouse and then split the data into separate data marts. There are valid reasons to do so when your company or organization contains many disparate teams which, for security, access and performance reasons, cannot share all of the same data.

The best practice for this setup is to fully model up your data and then split it after it has been warehoused. The data can be placed in separate schemas, or even separate machines.

Unfortunately, some clients I've met exhibit a strange tendency to partially model the data and/or prematurely pass it on it to separate divisions or business units within the company. This process then gets labeled transformation, is run on a schedule, and finds its way into reports. Since the different teams and business units within the company have separately modeled the data, the conclusions from their reports may at best only more-or-less agree, and may in fact conflict in ways that introduce contention and confusion.

What should I do?

You should be willing to break boundaries, but by intent rather than ignorance. The common pattern of using a SQL-based columnar datastore is common for a reason. It has a thriving community with well-thought-out solutions to most problems. Deviating from this structure is only to be done when you know exactly why. The majority of the points outlined above can be distilled to a few core points:

1. Transactional databases are not analytical databases.

2. An app engineer is not a data engineer.

3. A DWH won't form organically, but a mess will.

Data warehousing, properly understood, must provide a single source of truth, where everything is brought together, fully reconciled, and made sense of. If the data architecture you are working with resembles one of the problem behaviors listed above, start planning a move. Your workload and data quality will thank you for it.

Good luck!

Appendix

[1] SQL is the common language of data warehousing, BI, and analytics. The odd data scientist might provide business value by crunching a huge flat file in R or Python but more often than not that file was probably the output of someone elses SQL. Working with structured, relational tables remains an industry standard. This is borne out in how Hadoop has been enriched with numerous SQL-on-Hadoop solutions including Spark SQL, Apache Drill, Cloudera Impala and Hive to name a few. SQL as a data manipulation language reduces the barrier to entry and widens the pool of talent and learning resources for a data-team. The quasi-exception to needing SQL as your DWH language is, of course, the aforementioned Hadoop solutions that also offer Python, Java and Scala APIs. The argument that Hadoop remains useful in the face of modern columnar data stores (Redshift, Bigquery*) and managed Data-lake extensions (Redshift Spectrum, Athena, Snowflake*) is not one I agree with but certainly one that exists.

Our discussion is more geared towards the use of transactional NoSQL databases to run analytical queries - primarily MongoDB. Although NoSQL and document-based solutions (eg. MongoDB, DynamoDB, Google BigTable etc.) are strong competitors to SQL in the world of production, they have no place in BI any more than their relational counterparts. NoSQL DBs often have core tenants that sound familiar to BI - denormalisation, read performance, and horizontal compute scalability, but this is where the likeness stops. Document-based DBs are even more tightly tuned for transactional behaviour than SQL counterparts. Although MongoDB boasts APIs for mapping, reducing and aggregation these are for a corner-case of production rather than analytics. In the face of a production environment that runs on a NoSQL datastore a quick duplicate to serve as a DWH might seem wise, but in truth you are best served to extract everything (`table.find({})` or SELECT * in SQL syntax) directly into a nice BigQuery** table and start SQLing from there.

*BigQuery and Snowflake have an implicit and explicit separation of storage and compute that leaves them in between a columnar datastore and data-lake as per the discussion above.

**Fun fact: Fivetran CEO George Fraser once hypothesized to me that part of Big Query's initial slowness in the face of Redshift was due in no small part to Big Query's odd implementation of SQL syntax

[2] Window Function Versions

Amazon Redshift – All versions
Google BigQuery – All versions
Snowflake – All versions
Oracle – v8i (1998)
PostgreSQL – v8.4 (2009)
Microsoft SQL Server – SQL Server 2012SQLite – v3.25 (2018)
MySQL – v8.0.2 (2018)
MariaDB – v10.2 (2017)
Apache Spark – All versions
Teradata – All versions
Cloudera Impala – All versions