As you might imagine, good database schema design is essential to making your data tractable. With a solid schema design in place, you can better make sense of your database and build the dashboards, reports and data models that you need.
In this article, we’ll run down the basics of database schema design, some best practices and how Fivetran provides normalized schemas to make life easier for analysts. Read on to learn more.
A database schema is a blueprint representing a data set's tables and relations. In other words, it’s the foundation for interacting with and using your data.
Schema design is especially important when working with raw data “from the source” since raw data feeds from SaaS APIs are seldom organized in a way that’s usable to analysts. At best, data from APIs are organized to accommodate general analytics use cases, but only those predicted by the provider. This might include pre-joined tables and pre-calculated fields (i.e., denormalization).
As a result, it’s almost always better to design your own database schema. While there’s a bit of thought involved, it’s not too difficult if you understand the data model and follow schema design best practices.
Though there are many ways to design a database schema, all good designs share a few common characteristics. At Fivetran, our designs attempt to meet the following principles:
1. A clear understanding of the application’s underlying data model
2. Normalization to create a single source of truth and maintain data integrity
3. Constructing an entity-relationship diagram (ERD) that clearly represents the following types of databases or relationships:
4. Standardization to allow templated analytics products to be built on top of schemas
These principles represent a substantial investment of time, effort and expertise. Before you can start designing, however, it’s important to understand the underlying data model.
The first and most important step to leveraging data from an application is to understand the underlying data model behind it.
Unfortunately, many APIs are poorly designed and documented, and some API endpoints can be both complicated and inconsistent. For example, the columns representing the same data are sometimes named inconsistently between tables.
In any case, data from API endpoints tend to arrive in a non-tabular format like JSON. From these formats, there are several ways to build an understanding of how an app behaves.
Our developers attempt to accomplish this by doing the following:
- Read the documentation
- Use the app themselves
- Speak with the users of the app
- Speak with the developers of the app (and/or the API)
Simple enough, right? Every SaaS app implicitly contains a representation of the world since the world consists of organizations, people, transactions and other common business concepts. Understanding these representations, then, is essential to understanding the data.
Once you understand the data model and what it represents, you can begin to design your database schema in a way that best reflects it. Keeping this general goal in mind is helpful during the planning stage, which is the first of our database design best practices.
The more upfront planning you do, the easier everything will be down the line.
That’s why upfront planning is not only our top best practice but also the first thing you should do when starting the database design process. Get this step right, and all the other best practices — normalization, ERDs, consistent naming, etc. — will quickly fall into place.
Make sure to keep your target users in mind during the planning phase. One effective way to verify your database design is to allow users to generate a few test reports before full-scale deployment. Their feedback will likely be invaluable as you verify (or modify) your design.
A broader discussion of the normal forms is beyond the scope of this article, but, in a nutshell, normalization strives to eliminate duplicate, redundant and derived data values.
At Fivetran, we often define our data integration workflow as ELT (extract, load, transform), but in reality, it’s more like E-t-LT. Fivetran performs a small transformation (t) for all of our app connectors that involves a modest amount of cleaning, massaging and reorganizing before the data is loaded into the data warehouse.
Normalization is only possible after a large, upfront investment of work to understand the underlying data model of the app. The knowledge that is painstakingly gleaned from API documentation, experience using the app, and consultation with the app developers, helps us understand the business logic performed by the app and how it represents real-world concepts. Then, we identify data columns that are redundant, duplicative or derived from other values, disaggregate or eliminate them, and construct an entity-relationship model.
Entity-relationship diagrams (ERDs) are visual representations of schemas or entity-relationship models. They represent the interrelations between tables. For example, the ERD below is what Fivetran builds for the Asana connector.
Note that every table has a primary key denoting the basic entity that the table represents and uniquely identifies each row in that table. Foreign keys are simply primary keys from other tables. They connect the entity represented by the primary key with another entity represented in another table.
Using primary and foreign keys, we can represent the following relationships in most database types:
- One-to-one. Two entities can only map onto each other and no other elements. A real-world example would be Social Security numbers, which only ever map onto one person. In Fivetran ERDs, if we encounter a nested object that has a one-to-one relationship with the main table, we simply flatten its data into columns.
- One-to-many: One entity in a table can correspond with multiple records in another table, but not vice-versa. An example of one-to-many might be a list of favorite colors and the customers who like them. Each color can have many devoted customers, but every customer only has a single favorite. If we encounter a nested object that has a one-to-many relationship to the main table, it is turned into a separate table.
- Many-to-many. These relationships are represented in join tables. The composite primary key in a join table consists of the primary keys of the two related entities. For instance, a person’s shopping habits might bring them to many stores, and each store will have many customers.
The ERD of a normalized schema allows the app’s underlying data model to be represented in a clear and concise manner. All Fivetran ERDs can be found in our documentation.
Data from a normalized schema is not necessarily useful for analytical purposes out-of-the-box, as the data has been arranged on the basis of real-world logical relationships rather than the computed and derived values commonly used for reporting, analytics and training algorithms.
Luckily, while there are as many ways to design a denormalized schema as there are opinions, there are only a handful of plausible normal forms. Standardizing on normalized schemas is a way to create the lowest common denominator that can be later transformed to enable virtually every use case.
Since the KPIs of many business intelligence dashboards, such as sales and marketing funnels, revenue reports, customer attribution, and so forth, are well known and similar across customers, it is possible to transform the data into a denormalized form for such use cases. That is what we at Fivetran are doing with our analytics templates, which are a set of tools to help our analyst users denormalize the Fivetran data with SQL and dbt (data build tool) into a form that meets their specific analytical needs.
One of the best ways to avoid database headaches is to use naming conventions for fields, tables and every other data element.
No matter what you’re naming, names should be clear, consistent and meaningful to your audience. You should strive to make names clear enough that they can be understood at first glance. That may mean avoiding high-level jargon that might confuse the user but would otherwise be accurate.
And when we say consistent, we mean it: Naming conventions should be the same across your entire database. Some basic rules to follow include not using system-reserved labels for table names or column names, as well as avoiding hyphens and punctuation that may cause confusion or application code errors.
Your data is valuable — which means it deserves top-notch security from the start.
Instead of leaving things to chance, follow “security by design” as you plan your schema. Examples of security by design include establishing an encryption method, permissions (read: not everyone gets to be a database administrator), authentication for database access and so on.
While these points may seem obvious, planning your security from the start is one of the most effective ways to avoid hackers and other security threats.
Despite all this planning, your schema isn’t necessarily set in stone. Instead, it’s a living, breathing thing that will see a multitude of changes and updates over its lifetime.
Keeping up with these changes isn’t just good practice — it’s essential for keeping things clear and organized. To do so, you should maintain ongoing documentation that describes current implementations, their changes, and the reasoning behind them. Even things as simple as explaining relationships between different tables can go a long way in making your database usable.
Good schema design requires a substantial investment of time and talent. At the end of the day, however, adopting a standardized solution allows you to leverage the expertise of people who have made themselves intimately familiar with the data you want — so you don’t have to.