The following blog post is the third chapter in a primer series by Michael Kaminsky on databases. You can read Chapter Two here if you aren’t familiar with the characteristics of transactional and analytical workloads. You can also watch this series here.
The question of whether a system stores data in rows or columns determines whether it is optimized for transactional or analytical workloads, respectively. By changing how we store the data on the hard drive of the computer, we can make certain types of queries return faster. We want the types of queries we’re running most often to be as fast as possible.
If we’re working in an analytical paradigm, we want a database optimized for aggregate functions, but if we’re working in the transactional paradigm we want a database optimized for CRUD operations.
In order to understand row and column stores, it’s important to understand how computers in general, including databases, read data from disk. Hard-disk storage is persistent, meaning that data is saved if you turn your computer off and then on again. By contrast, RAM is “working memory” which isn’t saved if you turn your computer off.
The following explanation is not 100% technically precise, but should hit the right level of abstraction for a non-technical audience. On the hard disk, at a low level, the data is organized into blocks, which is the smallest unit of information that a computer reads off of disk at one time. Any program, including a database program, will load the data that it needs by looking up the blocks that contain the data it needs, and then will go to each of those different blocks and load them into memory so that you or the program can work with them.
The database will operate faster if it reads fewer blocks. To reduce the number of blocks a database needs to read, you can either limit the overall amount of data being read, or ensure that the data is concentrated in the smallest number of blocks possible. That is the key insight we’re going to use as we’re talking about whether we store the data on disk by rows or by columns.
Let’s imagine that we have the following table with a little bit of information for a large number of users:
For a row store, when we store these data on the hard disk, we’re going to write these data one row at a time.
First we write Juan’s ID, then his name, then his state, and then his age. Then we write Elizabeth’s data, then Scott’s data, like so:
1 Juan NY 29 | 2 Elizabeth TX 22 | 3 Scott MA 35 | … | 6893849 Fernanda CA 41
Imagine that each pipe “|” represents the end of a block. Each person’s information is on the same block so that all of Juan’s data is together, all of Elizabeth’s data is together, and so on. This method of organization is extremely conducive to operations on a single object.
For a column store, we instead write the data by column, like so:
1 2 3 ... 6893849 | Juan Elizabeth Scott … Fernanda | NY TX MA … CA | 29 22 35 … 41
As with the previous example, each pipe represents a block. Instead of organizing the data by each person, we organize it by each data field, so that all ID numbers are in one block, as are all names, all states and all ages. Organizing data by column makes calculations that summarize the contents of a column far quicker.
What this Means for Querying
Let’s compare how different queries work with these different storage methods. A transactional query will read and manipulate one object at a time and might look like the following:
SELECT name, state, age
WHERE id = 3;
It should be easy to see why a transactional query will be far more efficient when data “blocks” are organized by individual objects, each represented by a row:
An analytical query, by contrast, will struggle as it will have to parse a very large number of “blocks” in order to access all of the relevant data:
An analytical query will perform an aggregate function on an entire column while ignoring most details about each record:
SELECT state, count(1) FROM data GROUP BY state;
The analytical query will be far more efficient when data “blocks” are organized by column:
A transactional query, by contrast, will be forced to read the contents of every column in order to perform basic operations on a single object:
In the next part of the series, we will take a deeper dive into transactions.
The series continues with Chapter 4 here.