How to choose between a columnar database vs. row database
How to choose between a columnar database vs. row database
Database performance relies heavily on the type of database storage you choose. Most organizations use either a row database, a column database or a combination of both.
Row-oriented databases are ideal for some access patterns, specifically transactional processing, while column-oriented databases are better suited for analytical processing.
The database type you choose — row vs. columnar database — is vital. The decision needs to be carefully made after weighing the advantages and tradeoffs of both types and analyzing your usage needs.
In this article, we’ll look at the difference between a columnar and row database and how they work. We’ll also explain their pros and cons and why columnar databases are superior for analytics teams.
[CTA_MODULE]
Row database
Row-oriented databases are used in traditional database management systems that are primarily focused on storage. In a row store or row database, the data is stored row by row. This means that data from the first column of a row will be placed after the data from the last column of the previous row. Let’s simplify this with an example. Say you want to store the data from a table called “Top Customers”:
In a row database, this would be stored as:
John Male USA 63 | Mary Female Canada 29 | James Male Australia 48 |
Here the “|” signifies the end of the block, meaning all of the data within that block is stored together.
This disk storage method is ideal for transactional queries when data is divided into blocks and rows.
So, row databases are commonly used for Online Transactional Processing (OLTP), where a single “transaction,” such as inserting, removing or updating, can be performed quickly using small amounts of data.
Row stores, like MySQL and Postgres, are great when all or most of the data in the row need to be used. Data engineers use indexing to avoid or speed up full-table scans and optimize queries. This is when a unique key is created from columns based on your access patterns.
Pros
Modifying data is easier
Since data within a row database is stored in individual blocks, adding new data and modifying current values is easier. New data is just appended to the end of the previous block. For example, suppose we add new values to the “Top Customers” data.
It would then look like this:
John Male USA 63 | Mary Female Canada 29 | James Male Australia 48 | Kate Female UK 38 |
Deleting a row is also easier in this storage since the entire data block can be wiped at once.
Ideal for OLTP
Row-based databases are very good at processing single-row operations. They are built for fast and efficient online transaction processing in highly concurrent environments and are often heavily indexed.
Cons
Slower data aggregation
Row-oriented databases struggle with aggregation since data from every row has to be loaded before relevant data can be extracted and acted upon.
For example, if we wanted the average age of male customers in the “Top Customers” table and used a row store, it would load all the customer data first and then pull out the relevant data.
Insufficient compression
Compression mechanisms are less efficient on row databases since multiple rows containing different data types must be compressed simultaneously. Poor compression means handling large volumes of data will be difficult.
Requires additional space
Since row databases cannot be compressed effectively, they require significantly more storage space than columnar databases. Apart from the data, row databases take up extra space because they use multiple indexes.
Columnar database
A columnar database stores all the data from each column as a single block. Think of it as vertical partitioning compared to the horizontal partitioning of a row store.
For example, a columnar database would store the ‘Top Customers’ data in the following manner:
John Mary James | Male Female Male | USA Canada Australia | 63 29 48
Here’s an illustration of this:
Organizing data this way makes it easier to aggregate data and perform calculations. If we wanted to find the average age of this group, we could directly calculate this using the data from that specific block. This makes column-oriented stores great for online analytical processing (OLAP).
Pros
Best for OLAP applications
Databases using columnar storage have a major benefit over equivalent databases using row-based storage: query performance is much faster for analytical queries that crunch through lots of data. Data is only accessed if required to compute the query result.
Most modern analytical and reporting relational databases such as Amazon Redshift, SAP HANA, Oracle, Microsoft and Actian’s Vector database use columnar storage. If you’re looking for a high-performance solution to support reporting through SQL-based tools, then consider a columnar database.
Faster data aggregation
Data aggregation is faster in columnar databases since each block contains all the values from a particular data field. This enables analytical applications to aggregate a high volume of data from multiple columns without searching the entire table. Rapid aggregation is a must for analytical workloads that typically involve complex queries.
High compression speeds
In a column-based database, every column has uniform data types, so compression algorithms are far more efficient. Each column can be compressed as a single file, allowing analysts and data engineers to compress high data volumes quickly.
Requires less space
Better compression means data will occupy less disk space. Organizations can store, aggregate and operate on many data sets without investing in additional storage options.
Cons
Data modification is slower
Single-row operations on columnar databases are generally less efficient and modification is slower than in row-based databases. Columnar databases prefer to process inserts, updates and deletes (or merges) as batch operations.
For example, if you wanted to add a row of data, each value from the new row has to be added to the correct block of the existing database. This can become complicated when there are many columns.
Here’s a summary of how a row database vs. columnar database stacks up:
Why columnar databases are superior for analytics
From an analytical perspective, columnar databases trump row store easily. There are three reasons why.
Compression speed
As we discussed earlier, compression speeds are much faster on a columnar database compared to a row database.
This is crucial for analytical queries since it enables analysts and business intelligence tools to parse through high data volumes and gain insights rapidly.
Easier access
Typically, analytical applications require different data from multiple columns. Organizations usually have tables that consist of hundreds or thousands of columns and rows. A query on a row database would require loading all this data and then parsing through relevant sections to find the answer. Often, data changes much faster than the time a row database needs to solve the query. A columnar database makes it easier to access rapidly changing data and enables applications to solve queries quickly.
Better storage
Compression mechanisms are more efficient in columnar databases. This enables highly efficient storage. A column database can store significantly more data than a row database when given the same disk space.
Columnar databases also allow data to be sorted in different ways. Sort-ordered columns or projections make column databases more error tolerant since data is stored multiple times.
Cost vs. performance
End users always want the best performance and up-to-date data, while management typically wants the lowest cost. Finding a balance between these requirements isn’t easy.
If cost wasn’t an issue then the solution may have been to put in the biggest OLTP system around (i.e., Oracle Exadata) and pay for columnar optimizations on the system to achieve fast transaction processing and high-performance queries on a single database. More realistically your organization may be thinking of alternatives like Amazon Redshift to balance performance and cost.
Heterogeneous replication
Keeping another database in sync can be achieved through real-time data replication. Most data replication solutions perform efficient log-based change data capture (CDC) on source OLTP databases to replay the operation using SQL statements on the destination system.
For a columnar database, you can add performance optimizations to ensure the system is up-to-date in (near) real-time, despite high transaction volumes on the source database(s).
ELT for columnar databases
The ELT (Extract, Load, Transform) process collects data from sources and loads them onto storage. Analysts and engineers then apply transformations to get viable data.
An ELT workflow is ideal for data pipelines that use columnar databases, like a cloud data warehouse. It facilitates rapid data collection and loading into a warehouse or a data lakehouse. ELT allows for faster data access, enables automated data pipelines, costs less than an ETL-based pipeline and leverages the native capabilities of cloud data warehouses.
Platforms that allow organizations to create automated ELT pipelines seamlessly, like Fivetran, are future-proof. They use connectors to easily integrate columnar databases, like Amazon Redshift, into your pipeline. ELT pipelines are the best choice for most modern organizations that rely on quick data collection from multiple sources.
Conclusion
Row and column-based databases serve unique purposes. However, current business needs call for speed. Inefficient storage can lead to slower analysis, hindering insights, innovation and growth.
Column-based databases are the go-to storage option for OLAP applications and facilitate faster query resolution. Businesses often use row and column stores together to handle different query types. Your data pipeline must accommodate both storage options and an ELT pipeline is best for this. It facilitates fast data transfer and custom transformations, both of which can enhance analytics.
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.