The following is a guest post by Josh Hall, Full Stack Analytics Engineer at Untitled, a systems integrator that regularly uses Fivetran to solve their clients' problems. Learn more here.
The introduction of the open-source tool dbt has enabled a sea change in data modeling and the construction of the modern data stack as a whole. For the first time, dbt puts software engineering principles into the hands of data analysts, analytics engineers, and data engineers allowing them to rapidly iterate and create robust data models that power data products.
So how does a small company create an avid community with over 16,000 analysts discussing new ways to use the tool in an open Slack channel? The underlying premise of dbt enables the reusability of code used to perform data modeling, allowing the dbt community to create “best-practice transforms” for the most widely adopted data sources - in other words, dbt is creating a library of standardized schemas and reusable code for all data teams to take advantage of.
In this post, we will be outlining the three major dbt components that can help your data team capitalize on the reusability of dbt: packages, macros, and dependencies.
dbt Packages
The signature feature of dbt are packages. A dbt package is a standalone dbt project that enables rapid extensibility for any project the package is added to. These packages contain macros and models, allowing users to leverage work done by other data professionals within your own project.
Most packages solve a specific problem or use case. For example, if you were using Fivetran to sync data from your Google Ads account to your data warehouse, you could leverage the Google Ads dbt package built by the Fivetran team to model your data, allowing you to focus on the business logic that the data will be used for and not worry about modeling the data yourself.
Packages are powerful. Once they are created, they can be integrated into any dbt project. All of the work done by other data professionals solving a specific use case gets packaged up and can be reused by anyone trying to solve the same problem. The ability to reuse others' code provides rapid time to insight from your data, eliminating the time traditionally spent modeling the raw data from sources. This increases the productivity of your data team, while also allowing for faster deployments of data products.
Types of packages range from SQL and data warehouse utilities to modeling for complex data sources. If you are using Fivetran as the ETL tool in your modern data stack, you have access to an impressive number of packages that can be seamlessly integrated into your dbt project. To view all of the packages offered by Fivetran, visit the transformation page of Fivetran’s website. You can also view all dbt community supported packages through dbt hub, which also includes all of Fivetran’s packages.
Macros
Macros represent another powerful dbt component that are centered around the entire concept of reusability. While dbt leverages SQL for the core components of the tool, macros leverage Jinja which allows users to create templates. By combining SQL and Jinja, data teams can create loops, modularize and repeat business logic, and even change the database target where the dbt project builds models. Together, SQL/Jinja allow a data team to apply programming logic to SQL code that would typically be difficult to perform using only SQL.
Here’s a concrete example: let’s say an analyst is querying a wide table that contains 40 columns. The analyst is preparing the table for use with a business intelligence tool and wants to include most of the columns except for two that contain IDs that are irrelevant to the business use-case. Instead of writing a statement that lists out all 38 desired columns, they can use a macro that only lists the columns they want to exclude from the table. Not only does this macro save the analysts time and effort once, but it can be used with any table/query result in the future. Reusability at its finest.
When it comes to macros in dbt, there are some trade-offs to consider. First, just because something can be written as a macro, doesn’t mean it should be. Consider how useful it is and how likely it is to be reused to determine the value of spending the time creating a macro. If you create a macro for a very specific use-case, you may spend a lot more time creating the macro than just writing the SQL to solve the problem.
Secondly, too many macros can make your dbt models harder to understand for other users. Because macros abstract code that would normally be written in SQL, it can be hard for users to be confronted with several of them in the same model.
Be pragmatic about the use of macros. If they make work reusable and your code more straightforward, they can be a huge benefit to your team.
Dependencies
Another power of dbt lies in the ability to create dependencies between models. These dependencies are created through a very simple function called ref (reference). The ref function allows one model to depend upon another model being referenced, creating a dependent acyclic graph (DAG) of your entire data model. This simple functionality can enable data teams to create massively scalable, easily managed data models.
Let’s say a company has a customer table that contains all of the dimensional data about their customers. Additionally, the company has two fact tables, one containing sales data and the other containing orders data. Using dbt and the ref function, the data team could create one model for the customer table that could then be referenced by both fact models. The fact models would reference the customer model and use the necessary information from the customer model in each respective fact model.
This type of modeling provides standardization of models. This means that business logic, naming conventions, and even data type casting can all be housed within dbt models that can be referenced by other dbt models. When this occurs, the data modeling process becomes very clean, which means models can be reused over and over again while simultaneously reducing repeated code.
Untitled + dbt + Fivetran
Untitled leverages dbt to rapidly deploy modern data stacks for our clients in days as opposed to months. With the help of the reusability components and technologies discussed above, Untitled ensures that every company we work with obtains the data modeling necessary to empower decision-making.
With a focus on reusability, Untitled can leverage community supported dbt packages in conjunction with our homegrown dbt packages, allowing for rapid, unparalleled deployment of data models for our clients. Our packages contain standardized data models for multiple data sources and convenient macros for additional customization.
Fivetran is paramount to Untitled’s ability to rapidly deploy modern data stacks. Through minimal, straightforward configuration, Untitled is able to create data pipelines and sync data from source to destination in minutes. This allows our team to pass off the traditional data engineering work to a robust tool like Fivetran and spend more time focusing on the application of data for each client’s use-case instead of spending time building pipelines to obtain data. Additionally, by leveraging Fivetran, Untitled can utilize the numerous dbt packages available on dbt hub, something other ETL tools don’t provide.
If you want to rapidly deploy a modern data stack for your company, while also gaining access to industry leading technologies, check out Untitled. And to garner the best-practice data schemas from 16,000 data analysts, make sure you’re utilizing dbt with Fivetran.