The following blog post is the sixth chapter in a primer series by Michael Kaminsky on databases. You can read Chapter Five here if you missed the second lesson about transactions. You can also watch the series here.
Let’s start our discussion of distributed databases by comparing distributed and single-node databases, and discussing where you might have encountered these technologies before.
Distributed databases are made up of, and store data on, multiple computers. By contrast, single-node databases run on a single computer (we often refer to these computers as “servers,” but it’s good to remember that “server” is just another name for a computer).
Examples of distributed databases include Google Spanner and Azure Cosmos, as well as all of the “big data” warehouses like Redshift, Snowflake and BigQuery. Single-node databases include the “classic” databases like PostgreSQL, MySQL, SQLite and many others.
Reasons for Distributed Databases
Distributed databases were developed for a few key reasons:
- For some applications, we might need to store more data than can fit on any one computer.
- Additionally, analyzing huge volumes of data takes a very long time. We can speed up our queries by using the computational power of multiple computers at the same time.
- Finally, resiliency is extremely important for many applications. We can’t allow our whole system to go down if our database hardware fails or if there is a network error. If we have all of our data stored on one computer and that computer crashes, then our system will completely stop working.
The first question you might ask is “why can’t we just use bigger and better computers?” Bigger and better computers do work up to a certain point. A lot of people working with PostgreSQL or MySQL databases, for example, will upgrade the underlying hardware for that database for a while before needing to switch to a distributed database.
Big computers, however, have the following disadvantages:
- Big computers are really expensive. The bigger the computer, the more expensive, and it scales non-linearly. A computer that has 128 CPUs is going to be more expensive than 16 computers that have 8 CPUs each. Computers get disproportionately more expensive as you make them bigger and bigger.
- There’s a limit to how big anyone can make a single computer. At some point you’ll reach a level where you just simply can’t buy a computer bigger than the one you currently have.
- The more specialized your hardware is, the more likely it is to break. Specialized hardware is more fragile, has more moving parts, and is more difficult to replace, so increasing the size of our computers exposes us to more risk of a breakdown.
An important concept in distributed computing is fault tolerance. If our database is all on one computer, and that computer goes down, then our app goes down for all of our users.
If we have an application or a system that’s mission-critical or business-critical, we don’t want the whole system to be taken down if and when there are small breakages, the power goes out in our data center, or the disk fails on our computer. These are all regular, otherwise routine stoppages.
With a distributed database, our database is made up of multiple relatively smaller computers that act together as one. This adds redundancy to the system. If one computer in our database goes down, our database can continue functioning. We can deal with more data this way and also improve our fault tolerance.
You will see the following terminology a lot if you read marketing materials or documentation about distributed databases.
- Cluster – In general when we’re talking about a distributed database that’s made up of multiple different computers, we will refer to the computers collectively as a “cluster.” So if you see the words “database cluster” it normally means that there are multiple computers underlying the database.
- Node – Instead of saying the word “computer,” in general we use the term “node” to describe the individual machines that make up the distributed database cluster.
So if we have a “cluster” that’s made up of four “nodes,” what that really means is that there’s one database made up of four individual computers that are networked together.
Big Compute vs. High Availability
There are two big paradigms of distributed databases. The first is “big compute” and the second is “high availability.”
In big compute databases we “split” or “shard” the data across different nodes, and each node executes the query against a subset of the data. Then, all of the results are combined. This allows us to process our data faster by having multiple computers each do a piece of the work, rather than having one computer do all of the work.
In high-availability databases, we make redundant copies of the data on different nodes, making the system extremely fault-tolerant. If one node breaks or falls out of contact with the rest of the nodes, queries can still complete and users can still use the system. This reduces dependence on any one piece of hardware.
Here is a simplified example of splitting or sharding data evenly across three nodes in a cluster. Consider the following table: