Learn

A guide to columnar database

May 29, 2023

SUJETS

Traditionally, data has been stored in relational databases using a row-based storage format. However, when dealing with large volumes of data, this format can become a bottleneck, leading to slower query performance. This is where Columnar Databases come into play. By storing data in a column-based format, query performance can be greatly improved by minimizing the amount of I/O needed during query execution. This same fundamental feature has been leveraged by modern Relational Warehouses such as Amazon Redshift and Snowflake. Now let's dive into this article and learn all about columnar databases and what they bring to the table.

What is a columnar database?

A columnar database management system (DBMS) is a kind of database that stores information in columns as opposed to rows. In a traditional row-based database, each row contains a complete record with all the attributes of the record. In a columnar database, however, each column stores a particular attribute of the record, and all the values of that attribute are stored together in that column.

‍

A relational database is best suited for storing data in rows, which is commonly used for transactional applications. On the other hand, a columnar database is designed for quickly accessing data in columns, which is typically required for analytical applications. Columnar storage of tables is crucial for efficient querying because it significantly reduces disk I/O requirements and the amount of data that needs to be loaded from disk.

‍

Column-oriented databases, like other NoSQL databases, are developed to scale "out" by using clusters of inexpensive hardware to enhance throughput, making them ideal for data warehousing and processing large amounts of data.

‍

Row-based vs columnar databases

Row based databases arrange data by grouping all the information associated with a particular record together in memory. This is the conventional method of data organization and offers certain advantages for rapid data storage. Row based databases are specifically designed to facilitate the efficient reading and writing of rows.

‍

A row-based data storage system organizes data by storing and retrieving data row-by-row, with all the attributes of a particular row residing in the same physical data block. This method is optimized for retrieving complete rows of data and is commonly employed in conventional RDBMS systems.

‍

The storage of data in a row-oriented database would appear like this:

‍

ID Name Age Department
1 Steve 35 Tech
2 Jenson 28 HR
3 Rohan 32 Sales

‍

In a row based data storage system, executing a query retrieves all attributes associated with the specified row, including those that are not relevant to the query. Hence, the speed of querying can be decreased, particularly for queries that only need certain attributes within a row.

‍

Column based databases arrange data by grouping all the information associated with a specific field together in memory. They have become increasingly popular and offer significant performance benefits for querying data. Column based databases are specifically designed to facilitate efficient reading and computing on columns.

‍

A storage system for data that follows a column-oriented approach organizes and stores data by columns instead of rows. This technique is designed for efficient retrieval of specific columns of data and is commonly utilized in data analytics and warehousing systems.

‍

The organization and storage of data in a column-oriented data store would appear like this:

‍

ID Name Age Department
1 Steve 35 Tech
2 Jenson 28 HR
3 Rohan 32 Sales

‍

In a column-oriented data storage system, executing a query retrieves only the specific columns of data requested, resulting in faster query performance. Furthermore, column-oriented data stores can employ compression methods to reduce storage space and enhance performance.

SN1	Row based database	Column based database
1	When data is stored and retrieved row-by-row, there is a possibility of retrieving unnecessary data if only certain data in a row is needed	However, in a column-based data storage system, data is stored and retrieved by column, making it possible to read only the relevant data when required
2	They are best suited for online transaction systems (OLTP)	Column-based databases are best suited for online analytical processing (OLAP)
3	They are not efficient at carrying out operations that apply to the entirety of datasets. As a result, performing aggregation tasks or operations in a row-based data storage system can be quite costly	They are effective at performing operations that apply to the entirety of a dataset. This makes it possible to perform aggregation over numerous rows and columns in a column-based data storage system
4	Examples: PostgreSQL, MySQL	Examples: Redshift, BigQuery, Snowflake

‍

Limitations of row-based databases

Here are a few limitations of row-based databases:

Even though it is possible to use data compression in a database that follows a row-based approach, the compression ratio is limited because all columns are stored together in a data page and have different data types and values.
In a row-based database, each entry is saved as a row within a single data page. Therefore, it is not possible to access or handle individual columns without accessing the entire row first.
When multiple threads or nodes try to access the same data page, a row-based database may face contention problems that constrain its capacity to handle parallel processing, thus affecting its scalability.

How do columnar databases store data?

Columnar databases store data by organizing and storing data by column, rather than by row. This means that all values in a given column are stored together, rather than all values for a particular record or row. For example, all values for the "age" column would be stored together, followed by all values for the "name" column, and so on. This allows columnar databases to retrieve only the specific columns of data needed for a query, resulting in faster query performance. Furthermore, column-oriented databases have the ability to apply compression methods to decrease the storage capacity needed and enhance its efficiency.

For example,

ID Name Age
1 Steve 35
2 Jenson 28
3 Rohan 32

‍

In this case, it will be stored as : 1-2-3-Steve-Jenson-Rohan-35-28-32.

Advantages of columnar data storage

Columnar data storage offers several advantages over row-based data storage:

Faster query performance: Since columnar databases store data by column, they can retrieve only the specific columns of data needed for a query, resulting in faster query performance.
Compression efficiency: Compression techniques utilized by columnar databases are specifically designed for columnar data storage, leading to a more effective utilization of storage space and enhanced system performance.
Multipurpose: Columnar databases have become increasingly popular in big data applications and have multiple use cases, such as running OLAP cubes, storing metadata, and performing real-time analytics. This versatility is due to their ability to rapidly load new data.
Self-Indexing: One advantage of column-based DBMS is their ability to use less disk space than an RDBMS with the same data, thanks to their self-indexing feature.
Speed and Efficiency: Compared to other database approaches, columnar databases have superior speed and efficiency in executing analytical queries. They are also effective at executing joins, which can be a sluggish and inefficient method of merging data from two tables in a relational database. Conversely, a columnar database can quickly combine multiple datasets and summarize query results into a unified output.

‍

How columnar databases use compression techniques?

Compression techniques are employed by columnar databases to decrease storage demands and enhance query performance. There are several ways in which they achieve this:

‍

Dictionary encoding: This technique uses a dictionary to store unique values for a column and replaces the original values with shorter codes. This results in a reduction of the storage space necessary to store data, particularly for columns that possess a high cardinality.
Run-length encoding: This technique stores repeated values in a column as a single value and count pair. For example, if a column has 1000 consecutive values of 'A', it can be stored as 'A-1000'. This technique is especially effective for columns with long runs of repeated values.
Bit-packing: This technique stores multiple values in a single machine word, reducing the amount of space needed to store the data. For example, if a column contains boolean values, each value can be stored as a single bit, instead of a full byte.
Delta encoding: This technique stores the difference between adjacent values in a column, instead of the actual values. This is useful for columns with sequential data, such as dates or timestamps.

‍

By using these compression techniques, columnar databases can significantly reduce storage requirements and improve query performance, making them well-suited for analytical workloads.

How columnar databases deliver faster query performance?

Columnar databases offer faster query performance because of their unique storage and processing architecture. Rather than storing and handling data by rows, columnar databases store and manage data by columns. This technique facilitates more effective data retrieval and processing.

‍

In a columnar database, only the specific columns needed for a particular query are retrieved from memory, whereas a row-based database retrieves entire rows of data. Columnar databases can achieve faster query performance by retrieving only relevant data and avoiding irrelevant data. This is possible due to the way data is organized and managed by columns.

‍

Additionally, columnar databases use advanced compression techniques like run-length encoding, dictionary encoding, and bitmap encoding, which can significantly reduce storage space and enhance query performance. By compressing data by column rather than by row, columnar databases can achieve higher compression ratios and faster data retrieval.

Moreover, columnar databases are optimized for parallel processing, allowing queries to be executed simultaneously across multiple processor cores or nodes, leading to even faster query performance.

Popular columnar databases, including:

Apache Parquet

Apache Parquet is a columnar storage format designed for efficient data storage and processing of large datasets in distributed computing environments. It is an open-source file format that provides a highly optimized and compressed way of storing structured data.

Parquet is designed to be compatible with a variety of data processing frameworks, including Hadoop, Apache Spark, and Apache Arrow. It is based on the Google Dremel paper, which introduced a columnar storage format for analysis of large-scale datasets.

Parquet is designed to support both complex nested data structures and high-performance predicate pushdowns for efficient filtering of data during query execution. This feature makes columnar databases ideal for scenarios involving data warehousing, business intelligence, and data analytics.

Apache ORC

Apache ORC (Optimized Row Columnar) is an open-source columnar storage format that was developed by Hortonworks and later adopted by the Apache Software Foundation. It is designed to store structured data in a highly compressed and optimized way, making it ideal for big data processing and analytics. ORC stores data in a columnar configuration, resulting in improved and faster data processing and analysis. It offers support for multiple data types, including intricate data types such as arrays, maps, and structures. ORC is extensively utilized in big data frameworks like Apache Hadoop, Apache Spark, and Apache Hive for performing high-performance data processing and analytics.

Amazon Redshift

Amazon Redshift is a cloud-based data warehousing service by Amazon Web Services (AWS). It is designed to handle large-scale data sets and complex queries in a high-performance and cost-effective manner. Amazon Redshift is based on a columnar data storage architecture, which makes it well-suited for analytical workloads and big data applications.

Google BigQuery

Google BigQuery is an enterprise-level data warehouse that operates on a cloud-based infrastructure, allowing users to store and interrogate extensive datasets by employing a language similar to SQL. It was created by Google and forms a part of the Google Cloud Platform toolset. BigQuery is a fully managed service, which means that Google handles all aspects of infrastructure management, including scaling, availability, and security. This allows users to focus on querying and analyzing their data rather than worrying about the underlying infrastructure. BigQuery is designed to handle large-scale data processing and can be used for real-time analytics, machine learning, and data warehousing.

How FiveTran increases data transfer speed & flexibility with Columnar database?

FiveTran is a cloud-based platform for integrating data that empowers businesses to copy data from multiple sources to their cloud storage or data warehouse. In order to boost the speed and adaptability of data transfer, FiveTran employs a columnar database for storing and managing data.

‍

By using a columnar database, FiveTran can store data in a way that allows for faster data retrieval and processing. This is because columnar databases are optimized for analytical queries and can quickly retrieve specific columns of data needed for a query, rather than retrieving entire rows of data as in a row-based database. This means that FiveTran can skip over irrelevant data and retrieve only the relevant data, leading to faster data transfer and processing.

‍

In conclusion, columnar databases use compression techniques to reduce storage space and improve query performance. By compressing data by column rather than by row, FiveTran can achieve higher compression ratios and faster data retrieval, which further increases data transfer speed and reduces storage costs.

‍