At Fivetran, we have seen many organizations have success combining our tools for automated data extractions and loads with dbt to perform complex data transformations. However, minimizing latency to completely land a transformed, up-to-date, and correct dataset within a data warehouse can require intricate timing and a hierarchy of dependencies. These requirements can be met by an external scheduler specifically designed for orchestrating workflows. This blog is the first in a series that will discuss how Airflow can be used for orchestration in the modern data stack. We will define both scheduling and orchestrating, illustrate the differences, and provide a guide for when each should be considered.
Fivetran automates syncs between sources generating data and destinations storing data. Each connections’ Sync Frequency defines an interval for Fivetran to begin moving data from a source to a destination, and can range from every 5 minutes to every 24 hours. Syncs can also be triggered via Fivetran’s API to create a fine-grained, programmatic schedule. Once in a data warehouse, data can be transformed on a regular basis via dbt jobs that can run on a schedule specified in cron.
(Scheduling in Fivetran (left) and dbt (right))
There are a number of characteristics that define both Fivetran and dbt scheduling, the most valuable of which is that they are simple. Scheduling in Fivetran can be performed without code just by moving a slider for each connector. dbt requires at most a few lines of code to specify when transformations should run.
Because of the extract-load-transform (ELT) nature of the modern data stack, dbt transformations only occur after Fivetran extractions have been loaded into a destination, making scheduling sequential. Fivetran and dbt are scheduled independently, though, so timing when they should begin can be hard. Spacing the two tasks too far apart can lead to latency issues, and data quality issues can arise if they are scheduled so close together that Fivetran and dbt jobs overlap. Both situations make it difficult to scale data pipelines and may benefit from an orchestration solution.
Modern applications are built via a collection of many loosely coupled services that lead to an outcome. The execution of services and the order in which they run to achieve an outcome can be defined as a workflow, and the relationships and dependencies between services that comprise a workflow can quickly become difficult to keep track of as the number of services grows. These workflows can be managed and run by another service, and we collectively refer to these services as orchestration tools. Popular examples in this space include Airflow, Dagster, and Prefect. These tools extend the scheduling described in the previous section in a number of common ways, marking the difference between scheduling and orchestrating.
(An ELT workflow as a directed acyclic graph in Airflow)
All orchestration tools have a few things in common. They are dynamic. Since tasks are configured as code, adding additional tasks to an existing workflow or defining which workflow to execute is easy. This allows extendable scheduling beyond just ELT into downstream tasks like Looker as well (we will take a look at end-to-end orchestration involving Fivetran, dbt, and applications in a future post in this series). The execution of orchestrated workflows can be distributed. Unlike sequential scheduling, orchestration enables the explicit definition of dependencies between tasks, which is maintained and organized in a directed acyclic graph, or DAG. As soon as task’s dependencies are completed and the orchestration tool has computing resources available, it will run. The completion of one task can trigger another or multiple tasks to run asynchronously and in parallel.
In the modern data stack, this means a dbt transformation will not start until Fivetran has finished loading data into a warehouse, eliminating the latency and data quality issues that can arise in scheduling mentioned earlier. These tools are also robust. Like scheduling, functionality is built in to ensure and maintain idempotency, but is maintained through a single pane of glass to monitor the execution, status, and logs of all tasks that may make up a workflow with orchestration tools.
More to come: Fivetran and Airflow
Scheduling ELT happens at the task-level and is simple and sequential, but may not be scalable. Orchestrating ELT occurs at the workflow-level and produces a DAG that is dynamic, extendable, distributed and robust, but is not lightweight.
The next post in this series will show how to trigger and manage Fivetran syncs in Airflow via Fivetran’s API.
Are you currently using Airflow to orchestrate your modern data stack? If so, please contact Fivetran’s Developer Relations team at email@example.com.