If you’ve browsed Netflix’s catalog or searched for items on Amazon’s website sometime in the last several years, you’ve likely seen the computing power of a distributed database firsthand.
Technology giants like Microsoft, Apple, Paypal and others are often the first companies we think of as the primary users of distributed databases.
But what exactly are distributed databases, and what are their use cases? Keep reading to learn more.
In this article, we’ll dig deep into everything you need to know about using a distributed database for your organization.
[CTA_MODULE]
What is a distributed database?
A distributed database is any database where records are held in more than one location. It’s a web of many different databases connected through one centrally accessible network.
A distributed database management system (DDBMS) centrally manages all connected databases and retrieves information from them as though they were a single database. This way, a distributed database can hold virtually unlimited amounts of data.
While a distributed database stores data in multiple, but connected locations, a centralized or a single-node database stores your entire database on only one server.
If you’ve encountered a database like PostgreSQL, MySQL or SQLite, you’ve worked with single-node databases. Many people think of these as “classic” databases, as they existed well before distributed databases were created.
What are the different types of distributed databases?
There are many ways to organize and configure a distributed database; each comes with its benefits and disadvantages. Depending on your unique situation, one type might work better for your processes.
Here’s a look at the different types of distributed databases.
Homogeneous vs. heterogeneous distributed databases
A homogeneous distributed database is a network of databases that store the same data on multiple sites. Each of the individual nodes that connect to the other uses the same operating and management system. A distributed database with identical nodes is easier to manage than a distributed database set up with a heterogeneous organization.
With heterogeneous distributed databases, the nodes connected to each other operate on different software or data management schema and can also store data using different methods. This can cause issues when recalling and analyzing data, as translation is first needed to correctly parse the data and often, issues can arise during translation.
Replication vs. fragmentation
There are two main ways data is stored on distributed databases. The first type is known as data replication. This is when the data is saved as duplicate copies on multiple different nodes.
Replicated databases allow for much faster data availability and recall. They work by appointing a “primary” data node and syncing its data to other nodes. If one of your databases goes down or has scheduled downtime for maintenance, you can rely on the other replicated databases to recall the data.
Replicated databases require constant coordination and syncing to maintain consistency between the primary node and the replicated nodes. Data can be replicated synchronously or asynchronously, which simply determines how automatically the data is replicated from the primary to secondary nodes.
The second method of data storage is known as fragmentation. There are two different types of fragmentation, horizontal and vertical.
Horizontal vs. vertical fragmentation
With horizontal fragmentation, data is separated by rows. The primary keys used in data recall refer to just one record. This fragmentation method works well if you often need to grab information relating to one section or branch of data.
Vertical fragmentation, on the other hand, splits up your data by columns while retaining a common identifier so that the data broken up into each column can still be referenced back to its correct row and that no data is lost.
This method of fragmentation is best used for organizations that need to access all the records but not all of the data points associated with the record.
What is a distributed database used for and who uses them?
Distributed databases offer critical benefits over centralized databases, especially when it comes to storage capability, scalability and cost-effectiveness.
Why were distributed databases created?
Believe it or not, computers can only get so big.
We’re not talking about physical sizes — such as the first generation computers of the 1940s and ‘50s, which were about 50 feet long and weighed thousands of tons — but rather computational and storage size.
Today, the largest hard disk drive available for consumer purchase is around 20 terabytes, though non-consumer use drives can get all the way up to 100 terabytes. At those sizes, data recall and computation speed starts to suffer seriously. If you want to not just store your data but use it, then getting a bigger computer won't solve your data problems.
This is where distributed databases come in. Instead of storing your information on one single computer and making it bigger when you run out of storage space, you can connect a second device to your network of computers and continue storing and computing more data on that machine.
Let’s dig into the benefits that distributed databases offer entities, as well as some of the drawbacks.
What are the advantages and disadvantages of distributed databases?
As mentioned, computational and storage size is one of the primary benefits of a distributed database. By distributing your database across different servers, you can continue building a web of database storage that can grow to whatever size you require to fit your database needs.
Distributed databases allow you to scale more easily and quickly. Instead of needing to upgrade a single computer once you run out of space, you can retain that machine and simply add another unit to it.
This also has a cost-benefit. Single large-storage computers can be extremely cost-prohibitive for many organizations and individuals, especially if you need to constantly upgrade them to be able to hold more information. Several, smaller computers added as needed is a better way to manage your resources.
Distributed databases can also improve reliability and availability. With a replicated distributed database model, you can duplicate information across several different computers. If one goes down or is temporarily unavailable, that information can still be recalled from a different computer connected to the network.
Distributed databases also have improved speed and performance. When recalling information from a distributed database, you can run multiple queries simultaneously and each database will retrieve the required information simultaneously, allowing for data recall and computation at incredible speeds.
Disadvantages of distributed databases
There are also some drawbacks of distributed databases, including ease of management, communication issues and maintaining “consensus.”
Particularly with heterogeneous distributed databases, managing all of the servers connected through the database system may pose some difficulties as the system has to take into account all the different operational differences in the connected databases.
If one of the connected databases goes down, issues can arise in the confidence of your data if you aren’t able to pull from all the sources of your data. This is especially true with fragmented databases, where the data stored is split up across several different servers. If an outage occurs, you may not be able to retrieve all the data you need.
This can also be a problem with duplicated databases, though, which carry an additional issue known as reaching consensus. This is when there’s a discrepancy between two or more versions of the database and the database management system can’t decide which is the “most correct.”
Read chapter 8 of our Databases Demystified series where we discuss consensus issues in more depth.
Who primarily uses distributed databases?
The types of organizations that use distributed databases are overall larger organizations that use geo-distributed databases across many physical locations that need to compute massive amounts of data in their databases.
Multimedia organizations such as Netflix, which has to maintain an enormous library of content that’s tagged, categorized and only available in certain countries, is an example of when a distributed database comes in handy.
Large manufacturing or procurement companies that maintain and coordinate complex supply chains are another common use of distributed databases.
As mentioned above, however, there are applications for any organization that wants to scale its data capabilities quickly and have the capability to manage a network of databases.
Best practices for distributed databases
Suppose you’re trying to decide whether or not to use a distributed database for your data needs. In that case, there are a couple of initial questions you should ask yourself about what kind of database applications your organization is hoping to achieve.
Is your data growing quickly? Are you starting to reach the maximum capacity of your current storage systems? Has the speed of your data recall started to slow due to the growing amount of data stored? Are you opening new physical work locations and are trying to decide how to connect your data across all of them?
If you answered yes to (m)any of these questions, then a distributed database is most likely the right choice for you. It’s also important to figure out what type of database will work best for you.
How to decide which distributed database type is right for you
Take stock of your current infrastructure and ask yourself a few questions.
Do you already have a few servers that you’re storing information on and just need to introduce a management system to connect them all? Are they all the same type of server or will you be utilizing different operating systems and setups? Or will you be going from one server to many and therefore need to budget for the purchase of new hardware?
How to decide which distributed database solution is right for you
When it comes to choosing which solution you’ll use to manage your distributed databases, you should consider which features are most important to you.
Maybe cost is no concern, but that ease of deployment and day-to-day management is paramount. On the other hand, maybe you are willing to spend a bit more time deploying your new solution as long as it allows you to stay on budget.
What to do with your new distributed database?
Once you’ve decided on a solution to manage your distributed databases, you’ll want to make sure you’re getting the most out of your new expanded data storage and computational capabilities.
Now that your data is distributed, it’s all the more important that you’re centralizing your data outputs and optimizing your data pipeline from end to end. Fivetran can help you do all of that and more.
Conclusion
There are many different types of distributed databases and setting up your data pipeline for the best possible usage of your data is critical to getting the most out of your database. It’s important to evaluate which setup and use case will work best for your company’s unique infrastructure setup and growth potential.
[CTA_MODULE]