Star schema vs. OBT: An analysis of which is better for your data warehouse
Data warehouse modeling is a crucial but overlooked part of the development of a data warehouse. Data warehouse modeling is the process of creating schemas for the summarized and detailed information in a data warehouse.
Your data warehouse is essentially a repository for data collected from several sources which have related or similar data in various formats. Data warehouse modeling maps out these varying structures and formats to figure out how to ensure that incoming data adheres to your data warehouse’s design. This helps to make your data productive for data mining and analysis.
Think of your data warehouse model as an intermediary language that enables your systems to coordinate their acceptance and understanding of the data mentioned in the model. The choice of your schema can affect data warehouse performance. Many have wondered, and some have even claimed to know, whether star schema or one big table (OBT) provides the best performance in a data warehouse. In this article, we’ve run some tests to help you reach the right decision. But, before that, let’s learn about the data warehouse schemas you’ll be using.
What is a star schema?
A star schema uses a central table, known as a fact table, to store measured or transactional data. A fact table uses aggregated facts to store key business information, such as financial data or logged performance data. Fact tables are surrounded by multiple dimensional tables that store attributes related to data in the fact table.
Star schema is generally considered ideal for reporting as it makes data retrieval easier. It’s also quite convenient to use joins in queries in star schema, which can boost query performance.
What is OBT?
OBT stands for one big table. As the name suggests, it refers to using a single table for housing all data in a single large table. This approach ensures the warehouse doesn't have to perform any joins on the fly. Due to its simplicity, OBT is good for small teams and small projects that are focused on tracking a specific item. This item is usually the one that has several attributes associated with it.
For example, if you’re looking to use your data warehouse for customer analysis, your OBT will be focused on “customers” with attributes, such as customer id, name, age, etc.
Star schema vs. OBT: An analysis of which is better for your data warehouse
The objective of this analysis is to understand the performance implications of these different warehouse modeling patterns under normal BI-style workloads within a given warehouse. That is, we aren't trying to benchmark warehouses against each other or understand their relative performance and cost tradeoffs. We want to understand how different data warehouse modeling patterns perform once you've chosen which warehouse to use.
In particular, this analysis is focused on architecture patterns to support BI style workloads, not necessarily the query performance of miscellaneous, arbitrarily complex ad-hoc queries. The way many people build their warehouses today (using an ELT paradigm with a tool like dbt), the star schema is constructed at the end of an ELT run and is explicitly designed to support BI-type queries in tools like Looker or Periscope. With that in mind, the queries we use to test these different distribution styles are not especially complex, as they're intentionally designed to reflect the common sorts of queries that are run by a BI tool — aggregating measures over a variety of different dimensions, occasionally with a CTE or window function thrown in.
You can review the queries we used for the test here.
The results: Denormalized tables result in faster query response
For all three of the warehouses we tested — Redshift, Snowflake, and BigQuery — using a single denormalized table instead of a star schema leads to a substantial improvement in query times. The speed improvement of using a single denormalized table represents an improvement of 25%-50%, depending on which warehouse you're using. This amounts to a difference of about 10 seconds on a single-node cluster in Redshift. Excluding Redshift query compilation time, the improvements are:
- Redshift: 25%-30% (depending on warehouse size and number of clusters)
- Snowflake: ~25%
- BigQuery: ~50%
For the redshift results, we present data from runs using a large multi-node cluster as well as a small single-node cluster. We also split the results between the first time a query was executed (which will include the time Redshift needs to compile the query) as well as subsequent runs that only include compute time.
Here we can see that the OBT (denormalized) model outperforms the star-schema model in all but one of the ten queries we tested. With the exception of the query-4 enigma, the denormalized table outperforms the star schema from 10% to 45%, depending on the query.
For Snowflake, the results are more mixed. While the OBT (denormalized) model is definitely faster than the star schema in the slowest of queries (queries 8, 9, and 10), the star schema actually does appear to out-perform the OBT model in some of the simpler queries (namely 3, 4, and 7). Note that these queries include query compilation time.
We don’t have a good enough intuition for the inner workings of Snowflake to cast any light on why this might be happening, but if any of our readers are Snowflake experts, we'd love to hear your hypotheses!
For BigQuery, the results are even more dramatic than what we saw in Redshift — the average improvement in query response time is 49%, with the denormalized table outperforming the star schema in every category. Note that these queries include query compilation time.
One thing that's interesting to note is how dramatically different the variances in the query response times are between the two different distribution styles — the star schema has a much higher variance in query response time, which we assume has to do with how BigQuery is planning the execution under the hood (but we’re definitely not a BQ expert, so we would love someone with more knowledge to weigh-in on what's going on here).
This comparison was made using a subset of the data from the TPC-DS benchmark, kindly made available by the folks at Fivetran. For all analyses, we used the TPC-DS "100" data.
- dc2.large with 1 node
- dc2.8xlarge cluster with three nodes
- X-Large warehouse (16 servers)
- I used whatever the default configuration comes with a fresh warehouse
We make use of the following tables: store_sales, date_dim, store, household_demographics, customer_address.
For the star schema, we kept the tables as-is (distributing the fact table by ss_item_key and distributing the dimension tables across all nodes. In Redshift, we distribute this by ss_item_key as well).
For Snowflake, we exclude the initial run for the queries that serve the purpose of "warming the cache" by reading the data from S3 onto the SSD that Snowflake uses for caching. While the results of the query aren't being cached directly, we do want to assess Snowflake under the circumstances where the data has been read off of S3. You can read more about Snowflake caching here.
For the denormalized tables, we do a simple join to bring everything together.
All of the code to reproduce the analysis can be found in this repo.
There are a few reasons you might want to consider using the star schema (or something like it):
- The star schema promotes better ELT / ETL code conceptualization and organization.
- The star schema is easier for end users (analysts and other query writers) to navigate.
- The star schema takes up less disk space.
While the first two concerns are important, we think they can be handled pretty easily by staging your ELT process such that the data all get transformed into something like a star schema before everything gets re-joined back together for end-user querying.
The third point deserves more consideration, especially in a data warehouse like Redshift — materializing the denormalized takes up a significant amount of disk space on the cluster. Simply materializing the table bumped the disk-space usage up from a bit over 30 gigabytes to over 90.
This is only a subset of the data we could have joined to store_sales! In fact, when we initially started on this analysis task, we wanted to join all of the possible dimensions onto store_sales but couldn't because Redshift ran out of disk space (on a dc2.large cluster with one node).
Depending on the scale of your data, the storage cost of duplicating all of the dimensions on disk could be too high.
Get in touch
If you have questions or thoughts on this analysis, I would love to hear you. You can reach me via email at firstname.lastname@example.org, or you can find me at my blog locallyoptimistic.com.
Determining why the star-schema out performs the denormalized table on query 4 in the single-node cluster (but not the multi-node cluster) is left as an exercise for the reader. Mostly because I have no idea.
Because dbt doesn't have the ability to specify column compression or encoding style in Redshift, this is probably the worst-possible-case in terms of disk storage size. I suspect that with proper column encoding you could alleviate a fair amount of this issue.