Guides

Building a data warehouse design: Types, architecture & implementation steps

May 16, 2023

Fivetran

Anchor Link

Fivetran

Topics

data warehouse

In this post, we’ll look at how you can design a data warehouse for your enterprise and see how Fivetran can help you load data in your data warehouse and drill down to any level of detail you want.

Enterprises often have data dispersed across departments and teams. Disparate data doesn’t give business leaders a comprehensive and data-oriented view of how their business is growing and where it needs improvement.

‍

A single source of truth is critical to gain holistic insights. For example, by centralizing all data relating to supply chain management, CRM and customer success metrics in one place, an enterprise can prevent siloed thinking.

Data warehouses are data repositories that consolidate data from disparate sources for reporting and analytics. Data warehouses unify data organization and representation processes so that it’s easy to capture, share and analyze data.

Two approaches to designing a data warehouse

The two most widely discussed methodologies for data warehouse designing are the Inmon and Kimball approaches.

Bill Inmon (Top-down approach)

In the top-down approach, the data warehouse is designed first and then data marts (data structure pertaining to a single business line or team in a company) are created.

Data engineers first extract data from data sources and transfer it to the data warehouse. You can also have a staging area to store data before it’s moved into a data warehouse.

Post this; you can extract data from the data warehouse and apply summarization techniques on extracted data and distribute this data to data marts. The benefit here is that since data marts get all their data from a central data warehouse, data remains consistent as it departs from a central place. However, this method can be time-consuming and expensive.

The Inmon approach is useful if you need enterprise-wide data integration functionality, have a large team of data specialists and don’t have hard time constraints for data analysis. For example, an insurance company may want to adopt the Inmon method because it would give a complete view of their clients, mortality rate, claims history, etc.

Kimball (Bottom-up approach)

In the Kimball approach, data marts are created first based on business requirements and then integrated into a data warehouse. This approach is also called the dimensional data warehouse, where analytical systems can directly access data for analysis.

The benefit of the Kimball approach is you can build data marts quickly and generate reports faster. Plus, you can slice and dice data to get any info you want and evolve your data design architecture as your business needs evolve.

For example, you can create data marts for every department, like sales, marketing, finance, and HR. Each department can generate reports and scale up easily by adding new data marts.

How records are organized in a data warehouse: Schemas

Databases store information in rows and columns. They’re often used to organize data into tables or spreadsheets. However, to use this data, data analysts need to model and define data in a specific way. This is where schemas come in.

A database schema is a collection of rules that define how data is organized within a database. These rules determine how data is stored and retrieved from the database. Schema also lets analysts understand how different elements (dimension tables, fact tables) are related within a database. Also, since the database structure is pre-defined, it makes it easy for analysts to extract data from any source and interpret it to make business decisions.

There are three types of schemas.

1. Star schema

In a star schema, the core component is a single fact table. Around the fact table are dimension tables.

‍

‍

Fact table stores all the primary business information, usually in numerics. And dimension tables hold all the attributes related to the data in fact tables. For example, a fact table called “Sales” can have references to attributes like the product ID, order ID, quantity, price and discount.

A star schema makes it quick and easy to query data for reporting. It is designed to support queries such as finding customers who ordered a specific item.

2. Snowflake

A snowflake schema is useful when enterprises want to query highly complex data for advanced analytics.

‍

‍

Similar to a star schema’s design, snowflake schema also has a fact table in the center and the fact table is surrounded by multiple dimension tables. Here, dimension tables are further related to normalize tables.

So say, a “date” dimension table can have tables for day, week, quarter and month. These new tables can connect to the parent dimension table. This helps to handle more complex queries. For example, if a customer is interested in Product A and later wants to know about Product B via a live chat, the product dimension table will have specific info from the child dimension table.

3. Galaxy

A galaxy schema has multiple fact tables connected and mapped to multiple dimensional tables, making them ideal for organizations with complex data structures and databases. In a galaxy schema, data redundancy is low, which allows the data quality to be more accurate. This helps in powerful analysis and reporting.

Modern data warehouse design architecture

Until recently, only one- and two-tier architectures were prevalent for data warehouses. Presently, three-tier is the most preferred architecture.

As seen in the image above, the bottom tier includes a data repository that collects data from disparate sources. The data repository can save data from multidimensional or relational databases. However, before the data is stored, data is transformed as a part of the Extract, Transform, Load (ETL) process.

Data transformation can include:

Revising data
Data cleansing for formatting compatibility
Format conversions
Key restructuring
Deduplication (finding and removing duplicate data)
Data validation
Removing repeat and empty columns
Summarization
Sorting, ordering and indexing
Standardization

The middle tier consists of an online analytical processing (OLAP) server. OLAP works with two types of models:

Multidimensional online analytical processing (MOLAP): Which involves multiple databases
Relational online analytical processing (ROLAP): Which involves relational databases

The OLAP server is a crucial step for end-users like data analysts as it provides an abstract view of the database and acts as a mediator between the users and the database. The top tier is the user interface of front-end tools and APIs that help you get data out of the data warehouse. These could be any data mining, query, reporting and analysis tools.

Types of data warehouse designs

Based on various business requirements, there are different ways you can design your data warehouses.

On-prem vs. cloud

Historically, data warehouse architecture was mostly done in-house, but that comes with certain limitations:

It’s expensive as organizations have to purchase servers and a space to house the servers.
There are additional costs to hire staff to manage servers, locations, etc.
Because it’s self-hosted, scalability can be an issue.

In contrast, in current times, cloud data warehouses are widely adopted. Fifty-four percent of respondents in a survey shared that cloud data warehousing is a key trend they’d like to follow. That’s because they come with a lot of benefits:

There are no hardware or server costs.
The design is managed by the vendor, who in turn has experts working behind the scenes. This significantly saves hiring costs.
They’re flexible and every enterprise can mold it for their different use cases.
Speeds up access to data and saves time.
You only pay for what you use, which makes it affordable.

Traditional/batch vs. real-time

The traditional design involves loading data from sources in hourly, daily or weekly batches. Batch method was the most used method to load data, but since business users want to see insights instantly, real-time data warehousing design is becoming more prevalent.

In the real-time data warehousing model, data is constantly loaded into the data warehouse and is available for data analysts to create reports and predictions. End users get the most updated information and make faster decisions.

For example, a customer wants to get the most recent information about an online order they placed or a sales manager wants to evaluate trends from fresh sales data — this can happen only if data is sourced in real-time.

Other benefits of a real-time data warehousing model include:

Improves data democratization, where every team member can easily access current and historical data they need to complete their tasks and optimize their efforts.
Creates a base for advanced analytics and machine learning that helps design personalized customer experiences.
Improves the pace at which a business evolves and can respond to changes.

How to design a data warehouse: Step by step guide

Every data warehouse design differs based on various business parameters, use cases and requirements. Here’s a common blueprint you can use and modify.

1. Recognize business objectives and user needs

The first step is to ensure that your data warehouse is compatible with the business processes that you currently adhere to. Another critical fact is to check with your organization's stakeholders about their objectives in using the data warehouse. Then, you must also know about the technical requirements and compliance standards you must follow before the implementation starts.

Finally, answering these questions can simplify the first step:

How many data sources do you need to integrate and what’s the data volume?
What current and future business needs will the data warehouse fulfill?
What questions will the data warehouse help answer or how will the data warehouse architecture solve business problems?

2. Choose a data model

Data models help create documentation for the data warehouse implementation. They also help modelers decide how to structure data, establish relationships between different data points and establish key metrics.

Though data engineers can build enterprise data models, it’s a time-consuming process. Hence, a wiser option is to use a solution like Fivetran that offers pre-built data models. Fivetran normalizes data automatically so that the modeling technique becomes simpler and faster.

3. Select an ELT or ETL solution

You can choose an Extract, Transform, Load (ETL) process or an Extract, Load, Transform (ELT) process for data integration. However, ELT is a better and more flexible solution as it always generates raw and fresh data for analysis. Here’s more information about the differences between ELT and ETL processes. Our solution, Fivetran, has automated data connectors that take just a few minutes to set up and help you efficiently build and manage the ELT data pipelines. Once your models and solutions are decided, you can also define the scope in terms of deliverables, key people for each task, their KPIs, budget and timelines.

4. Build your reporting interface

How do you use the results from data queries to create business intelligence (BI) reports? That’s done by data visualization and reporting tools like Power BI and Tableau. For this step, decide how often you want to create reports and who will be the stakeholders for different types of reports so that you can choose a business intelligence tool that fits your requirements.

5. Rollout and perform assessments

Once all the core steps are implemented, enterprises can ingest data into their ELT/ETL tools, analyze the data and validate results from BI systems.

How Fivetran can help

Fivetran offers a fully managed and automated data pipeline. Every SaaS connector comes with normalized and out-of-the-box data schemas. Whenever data sources change, these schemas automatically adapt so that you get the most updated data.

‍

Fifty-two percent of IT managers want faster analytics processing. Once data reaches the data warehouse, Fivetran makes it quick and easy to transform data into analysis-ready models.

Fivetran lets organizations build pipelines from their data sources without writing a single line of code. Data scientists and engineers then don’t have to worry about building and maintaining pipelines and spend that time on what’s critical to the business.

‍