In 2012, 6.5 zettabytes of data was created. In 2022, the total amount of data created, captured, copied and consumed globally is estimated to be 97 zettabytes. By 2025, researchers forecast it will go beyond 181 zettabytes. To put things into perspective, each zettabyte is the information equivalent to about a trillion truckloads of paper.
In order to create reports and dashboards that guide decision-making, data must be integrated and modeled in a way data analysts can use. Database schemas help developers define the structure of the database, so data can be organized and efficiently retrieved for analysis and interpretation. Thanks to database schemas, analysts can import data from third-party sources, reconcile the data with their own systems and find the answers to their questions.
What are database schemas?
A database schema is a blueprint that addresses every functional element of a database, such as tables, fields, records and keys and their relationships to each other. Schemas can be designed either for minimal redundancy and maximum interpretability, or for performance and reducing the complexity, time and cost of the queries needed to pull data from a database. Good database schema design can be the difference between a query taking seconds and the same query lasting many hours.
Why design database schemas?
A well-designed schema helps ensure data integrity. You may design a database schema when designing a database from scratch, importing data from a different source into your own application or a data warehouse or reverse-engineering an existing database for better performance. In all of these scenarios, database schemas can ensure consistent formatting and the maintenance of unique primary and foreign keys.
When a schema is well-designed for analytics:
- Analysts don’t have to clean data or preprocess it before analyzing it.
- Analysts don’t have to reverse-engineer the underlying data model.
- Analysts have a clear, easily understood starting point for analytics.
A schema designed for analytics means faster access to data for creating reports and dashboards. The opposite means extra data modeling, the slow retrieval of data and the consumption of more resources (such as time and processing power) when creating reports for analysis.
In the data analytics world, both data sources and data warehouses use schemas to define data elements. But the schemas for data sources — whether they’re databases such as MySQL, PostgreSQL or SQL Server, or SaaS applications such as Salesforce, Facebook Ads, or Zuora — are often designed with operations rather than analytics in mind.
And while data sources such as SaaS apps may provide some general analytics functionality, they cover only the data from that single application. In order to make the best of your analytics, you will need schemas designed for analytics and the ability to connect data between sources. This will provide you with primary and foreign keys to identify the same entities across multiple data sources, enabling you to join data from multiple sources and analyze the combined data for better insights.
How to design a schema for data integration
You need a schema for every data source you import into your data warehouse. The data provider may have organized their data model according to their assumptions about the metrics their users want, which may not be exactly according to your requirements. Follow these steps to design a schema for data integration:
1. Understand the source’s data model
You need to understand the source’s data model to make sense of the data it will send your way. To do that, you can talk to the developers and data engineers who work with the source database and ask them to explain the data model. This will work in situations where you are trying to integrate data from a system within your organization or working with a partner willing to meet with you and discuss the data model.
If you have to integrate data from a SaaS provider, you can get documentation from the vendor that explains the data model. A better option is to integrate through a third party and use their data model; for example, here’s the Fivetran documentation for Salesforce.
When trying to understand the data model of an app, where meeting with developers or acquiring documentation is not an option, you can talk to the users of the application and see how the data flows. Based on what you find, you can start creating entity-relationship diagrams and make sense of how the data flows in the app.
2. Create entity-relationship diagrams (ERD)
Entity-relationship diagrams (ERDs) are visual representations of schemas or entity-relationship models. You can use these to develop a better understanding of how different tables relate to each other. You can also use these to get buy-in from analysts and end users.
There are three types of relationships between entities in different tables.
One-to-one: Two entities can only map onto each other and no other elements.
Think of Social Security numbers — they are unique to each person and will therefore only have a one-to-one relationship with each person. In Fivetran ERDs, if we encounter a nested object that has a one-to-one relationship with the main table, we simply flatten its data into columns.
One-to-many: One entity in a table can relate to many entities in a different table, but the opposite is not true.
Think of ice cream flavors sold by a company and the customers who have ordered products that feature them. A food retailer might use that data to determine a customer’s favorite flavor. That business insight can then be used to proactively recommend new ice creams that feature that flavor as they come to market.
The business value of the one-to-many database is therefore demonstrated: finding insights around commonalities in user behavior and the data it generates and aligning them with actions that drive revenue.
Each flavor can have many customers, but each customer only has a single favorite, which rises above all others. If we encounter a nested object that has a one-to-many relationship to the main table, it is turned into a separate table.
Many-to-many: One entity in a table can relate to many entities in a different table, and the opposite is also true. A many-to-many relationship between entities results in a new table called the join table that has primary keys from both tables.
For instance, a person’s shopping habits might bring them to many stores and each store will have many customers. The join table in the above diagram shows both customer-id and shop-id, which are primary keys in their own tables and present a clear relationship between customers and shops in the join table.
3. Normalize the data
Data normalization helps eliminate data redundancy. When normalizing data, you define rules that divide larger tables into smaller tables, link them using relationships and make sure the data appears consistent across all fields and records. There are six stages of data normalization:
- First Normal Form (1NF): You make sure there are no hidden values in any columns and only one value in each field.
- Second Normal Form (2NF): You make sure all non-key columns depend on the primary key.
- Third Normal Form (3NF): Non-primary key columns don’t depend on other non-primary key columns.
- Fourth Normal form (4NF): The table should not have any multi-valued dependency.
- Fifth Normal form (5NF): The table can be decomposed and recomposed again without losing or gaining any data.
- Sixth Normal form (6NF): The relation variables are decomposed into irreducible components and they do not support any nontrivial dependencies.
6NF is the final level of normalization. You can stop at any stage of data normalization, depending on the use case and your requirements.
After normalization, you can:
- Preprocess the data into a normalized schema and load it into its destination
- Publish your finalized ERD for your analysts
A dimensional schema is a database schema that consolidates information from across numerous data sources used by an organization in order to create a definitive but streamlined source of truth that is comprehensive enough to answer a wide range of business questions.
Before creating a dimensional schema, developers need a conceptual model that standardizes the relationship between different data elements such as name, email address, last purchase, etc. Developers collect information from stakeholders about how they handle data at the organization before creating a conceptual data model.
Developers also need a set of rules, called an internal system of record, that resolves conflicts between data sources. This internal system of record helps developers determine which data to capture from which sources and how it is produced and handled by different teams at an organization.
Think of data that’s shared between the marketing team and the sales team at a company. Marketing runs multiple campaigns and asks people to sign up for demos of a product. Then, people who sign up for the demos are passed onto sales as marketing qualified leads (MQLs). The data model that represents these operations should cover:
- The information that needs to be captured: Name, email address, phone number, job title, company name, product, date, time, campaign name
- The place the information will be stored: A separate table for each prospect using the email ID as the key identifier
- The way the information will travel: From the lead capture form to the central database, from the database to Salesforce (or a similar system)
The data model is tested thoroughly by asking it questions that it can’t answer to build resiliency. For example, what if the prospect changes their mind and does not want to participate in the demo anymore? What if they sign up for a free trial or purchase the product before going through a demo? In both these scenarios, there needs to be an additional piece of information, a “maybe” status that captures whether the prospect has to go through the demo, has gone through the demo, canceled the demo and more.
Once the data model is finalized and the developer has a better understanding of the information related to different business processes at an organization, they can move toward building the dimensional schema.
Styles of database schemas
There are six types of database schemas based on end-user interactions with database systems:
1. Star schema
Star schemas make it easy to query large amounts of data, aggregate it into smaller chunks and use it for business intelligence and analytics. They are ideal if you are working with a large collection of raw data and want to organize it, or you are reconstructing a large data mart or data warehouse.
The star schema has a fact table at its center and multiple dimensional tables that connect to it.
The fact table contains all of the metrics of a business process and usually has numerical values. The dimension table contains data elements that are related to dimensions such as location, time, product and people.
For example, the dimension tables for the example we discussed in the previous section, i.e., a prospect signing up for a demo, will contain information about:
- The lead forms used by marketing to generate MQLs
- The campaigns run by marketing
- The date and time of the campaigns
- The contact information of users signing up for demos
The fact table will contain information from the dimension tables for each MQL. The dimension table regarding products will provide the product ID, the contact information table will provide the email ID and so on. Each of these IDs serves as key identifiers or primary keys in dimension tables but will serve as foreign keys for the fact table. To avoid redundancies, there can be only one table for each dimension, so all the information about products and their IDs will live in the “product” dimension table and nowhere else.
The data in the fact table depends on data from the dimension tables for it to make sense, i.e., you can have the prospect’s contact information live in the fact table, but the dimension tables tell you the activity they performed, the product they are interested in and more.
2. Snowflake schema
Snowflake schema helps save disk space because the data is highly structured. Snowflake schema is used for highly complex queries and advanced analytics.
Just like star schema, snowflake schema also has a fact table at its center and multiple dimension tables that connect to it. The only difference between the two is that the dimension tables in a snowflake schema are further joined to normalized tables.
So a dimension table for “date” can be further normalized into tables for day, week, quarter and so on. All of these new tables can connect to the parent dimension table. The dimension table having further normalized tables allows for more complexity in how data is handled by the organization. For example, if a prospect expressed interest in one product but later expressed interest in more products via live chat or email conversations, the product “dimension” will benefit from a “child” dimension table.
3. Relational model
This type of schema is great for object-oriented programming, where data about objects takes precedence over logic and functions.
The relational model of database schema has a separate table for each object and each table can relate to multiple tables. An object, in this case, is a set of variables that defines the state of the object and the behavior of the object.
The above image shows the relational model of the database schema of a hospital. Each table relates to a specific object, i.e., prescription, procedure, room and appointment, are all objects that relate to each other in multiple ways. Each table can relate to multiple tables and there is no central fact table as in star or snowflake schemas.
4. Network model
The network model is used in situations where one data item belongs in many categories. In the network model of a database, there’s a root table that connects to multiple child tables and each child table can connect to multiple other tables.
Each child table can have many parent tables, allowing for many interrelated pieces of data to be retrieved for analysis. For example, if you are designing the database for a sports website, a network model will work well because every athlete will play for multiple teams and different teams will participate in different tournaments. By having tables that relate to multiple tables, you’ll be able to store and retrieve data by athlete name, team name or tournament name. The downside to using this database schema is that you’ll need to store a column for player names separately in each table.
5. Hierarchical model
This type of schema is useful in environments where you need to be able to add and delete information quickly. It is generally used in the telecommunications, healthcare and banking industries. The data at the top of the hierarchy is quickly accessible.
In a hierarchical schema, there’s a root table (or segment) connected to multiple tables. Each table can have multiple child tables, but each child table can only have one parent table.
The only difference between this diagram and the network model above is that the child tables do not have multiple parent tables. If we consider the banking sector, each bank maintains the records for multiple branches and accounts, and each account has an account number, type, balance and customer information. The account balance is more likely to keep changing regularly, and the customer information, such as address and phone number, may change from time to time. Keeping all of this information in a hierarchical structure allows the commonly used information to be retrieved quickly and easily.
6. Flat model
This type of schema is best for smaller, simpler databases. For example, a database management system that contains all of the daily transactions of a small-town bakery can function well on a flat model schema.
In a flat model database, there are multiple tables with multiple records, but the tables do not relate to each other, i.e., they exist in isolation. It can also just be a single file containing only one table with rows of data separated by commas or tabs. Usually, email marketing tools use this type of database, and the whole database or part of it can be downloaded as a CSV (comma-separated values) file.
Using Fivetran to jumpstart your analytics
DIY data integration involves many steps:
- Figuring out source data models
- Creating ERDs
- Designing database schemas
- Normalizing the data in accordance with the schema
- Performing tests
- Scheduling and performing the integrations.
It can take your developers weeks or months to integrate an application or a system.
Fivetran provides you with a fully managed, automated data pipeline. Every SaaS connector features normalized data schemas out of the box that also automatically adapt whenever the data source changes. Once the data is warehoused, we make it quick and easy to transform data into analysis-ready models. Give us a try and see how much easier analytics can be.