Fivetran and dbt™ are an essential duo in the modern data stack. dbt is an ever more popular choice for transforming data due to the framework it provides for executing data transformations in reliable and scalable pipelines. And, there are many ways to optimize your dbt deployment like using pre-built dbt models, or employing macros to assist with destination optimization.
As you scale your practices, optimizing your deployments can save costs — which is especially relevant in our pressurized economic environment. To ensure optimal performance, it's essential to follow best practices and leverage optimization techniques specific to your underlying data destination.
Cloud destination optimization
BigQuery, Databricks and Snowflake are three of the largest data destination providers of the modern data stack that exist today. All of these providers are cloud-based, with their own unique features, although many of their features are also shared. While we’re making recommendations, always refer back to your cloud provider’s documentation for the most up to date resources.
Optimizing a dbt deployment in a cloud data provider involves leveraging various features and techniques, some of which are listed below. Many of these features are shared amongst these providers.
Automated resource optimization for dbt query tuning
All three cloud destinations allow you to automatically scale up or down the resources allocated to a workload based on its current usage. You are also afforded the opportunity to scale out your destination, due to concurrency constraints. By enabling these features, you can improve the performance and cost-efficiency of the destination in which your dbt workloads sit.
Automatic table clustering
The automatic table clustering feature helps improve query performance by organizing data in a way that aligns with the access patterns of your queries. By using the dbt 'partition_by' configuration in BigQuery or Databricks, or ‘cluster_by’ in Snowflake, you can reduce the need for scanning irrelevant data, resulting in faster query execution times and improved system performance.
Query Acceleration Services
Query Acceleration Services (QAS) improve the performance of long-running queries, reduce the impact of scan-heavy outliers and scale performance beyond the limits of the warehouse size. By leveraging QAS in your dbt deployment, you can achieve faster query execution and optimize resource utilization. Variations of QAS are available in BigQuery, Databricks and Snowflake.
Resource management & monitoring
Optimizing cloud provider consumption
Implementing aggressive auto-suspend policies for dbt workloads of around 60 seconds is recommended. This is because there are some occasions where there’s less benefit to caching, for example, with incremental models where the result set continues to change. For heavy, steady workloads or when continuous availability is required, auto-suspend is not useful and you may prefer to disable it.
Imposing credit limits via resource monitors
Utilize resource monitors provided by your data provider to impose limits on the number of credits your dbt warehouse consumes. Resource monitors help control resource utilization and prevent unexpected cost escalations. Establish consistent naming conventions for resource monitors to ensure ease of tracking, allocation and reporting.
Maintain naming conventions
Adopt a naming convention for your dbt objects, including models, schemas, tables and columns. Naming conventions should follow an enterprise plan for the domain and align with virtual data provider naming conventions, for example: <domain><team><function>_<base_name> (USA_PRD_DATASCIENCE_ADHOC). Consistent naming conventions make it easier to manage and track objects, improving overall organizational visibility.
dbt destination adapters
The dbt destination adapters allow dbt to interact seamlessly with your cloud provider. By installing and configuring this adapter, you can take advantage of destination specific optimizations and performance enhancements, resulting in better query execution times. BigQuery, Databricks and Snowflake have supported adapters.
The target variable in dbt contains information about your connection to your chosen cloud destination provider. Use this variable, amongst other things, to limit data when working in your development environment, by using conditional logic.
Post-hooks for table optimization
Leverage dbt's post-hook functionality to run custom SQL commands after deploying dbt models. This allows you to perform additional optimizations, impose grants or cleanup tasks that can enhance query performance or maintain data consistency.
Optimize physical layout of tables
Optimize the physical layout of your tables by coalescing small files into larger ones. This can improve the performance of read queries by reducing disk I/O operations. Utilize the ‘OPTIMIZE’ command in any destination to coalesce files and optimize the layout of your tables.
In addition to platform-specific optimizations, there are general best practices for optimizing dbt deployments:
Intentional materializations in dbt refer to the strategies for persisting dbt models in the data destination. Different materialization types offer varying benefits and should be chosen based on the specific use case and requirements. More information about materializations, including the pros and cons, can be found here.
If these materializations do not meet your needs, you can also write your own custom materializations in your project and use them in the same way as materializations that come inbuilt with dbt. By carefully selecting the appropriate materialization type for each dbt model, you can optimize query performance, minimize build times and reduce overall resource consumption.
Packages in dbt are pre-built modules that contain reusable transformations and logic. Leverage existing packages or create your own to enhance productivity, maintain code consistency and reduce redundancy across projects. My top recommendations are as follows:
dbt_utils (efficiency): It’s impossible to avoid using this package since it’s the most essential package and is required for many other dbt packages. This package includes macros that apply to various dbt projects and can be utilized repeatedly. It’s the fundamental package that shouldn’t be overlooked. It has multiple functions like equal_row_count, at_least_one, not_empty_string, and so on. It also has SQL generators that make SQL coding much simpler, like date_spine, group_by, pivot, unpivot, etc.
dbt_project_evaluator (efficiency): This package highlights areas of a dbt project that are misaligned with dbt Labs' best practices. Specifically, this package tests for modeling (your DAGs), testing, documentation, structure and performance.
dbt-codegen (efficiency): Provides macros that automate the staging layer creation, such as the generate_source macro, which generates YAML code for a source file. Every dbt project on Snowflake should have the dbt_utils package installed, and the dbt_snow_mask and snowflake spend packages are recommended for dynamic data masking and analytics implementation, respectively.
dbt_expectations (code consistency): Inspired by the Great Expectations package for Python. The intent is to allow dbt users to deploy GE-like tests in their data destination directly from dbt, versus having to add another integration with their cloud provider. Tests are available for table shape, missing and unique values, sets & ranges, string matching, aggregate functions, multi-column and distributional functions.
Use Visual Studio Code (VSCode) extensions for dbt to streamline development workflows, improve code quality and enhance efficiency. The Wizard for dbt Core extension is a really useful extension to employ in your dbt deployments across BigQuery and Snowflake. Working with dbt in vscode can be a pain, but this extension really helps in making the partnership better. The most popular features are query preview, SQL to ref conversion, error highlighting, auto completion for columns, tables and refs and function signature help.
Where Fivetran fits in
No matter how you write your dbt models or store your data, Fivetran is an integral part of your pipeline. Our Fivetran Transformations for dbt Core helps you optimize your model runs, further reducing computational costs and data latency.
With our integrated scheduling, we immediately kick off model runs in your destination when new data from an upstream connector finishes syncing. This means that your analytics tables are always up-to-date. And, it means you're only running models when new data is available, reducing errant model run, and thus reducing costs.
With Fivetran, you can also see your model’s entire data lineage and error reporting in one place, improving visibility of your flows. This is more difficult to achieve in a cloud destination platform of your choice.
Finally, with close to 100 pre-built packages and 15 Quickstart packages of our most popular connectors, Fivetran gets you value faster. With Quickstart packages, you can add transformations to your data without even having to write a single line of code or have any Git repositories stood up. And, to ensure value continues to be delivered quickly, making use of our pre-built packages that are already designed to carry out the most popular business logic ensures that that is the case.
Optimizing a dbt deployment is essential for achieving efficient data transformations, reliable data pipelines, and more than anything else — a dependable modern data stack. Together with features of your cloud data provider, dbt and Fivetran, you can ensure that your dbt deployments are optimized for your business.