We can see a modern business data journey as a collection of workflows. Starting with data extraction from different sources, continuing with the loading and transformation of these data, all the way to the creation of reports and other data products, every step along the way can be understood as a separate workflow composed of smaller tasks. Data orchestration is the process of coordinating the execution and monitoring of these workflows. If we restrict our focus to ETL or ELT data pipelines, we can talk about data pipeline orchestration.
Given that processes in a data pipeline have interdependencies, it is necessary to have systems in place to coordinate these dependencies, execute the different tasks in the desired order, detect potential errors and solve them or generate alerts and logs. Without a robust orchestration system in place it is not possible to guarantee the reliability of data:
- Without a well-defined execution order, data is prone to errors, e.g. data can be outdated, transformations can be performed with incomplete data or results can be returned later than required by the business.
- Changes of schemas upstream can result in unidentified errors in subsequent tasks which expected a different input.
- Changes in the data model downstream can cause unidentified errors by making input data incompatible.
- The whole data journey can be very inefficient, for example if multiple parallelizable processes are always run strictly sequentially.
Furthermore, it is of vital importance for the business to be able to quickly detect errors and find their root cause, solve it and continue the data processes without having to restart from scratch. For this to be possible we need a system that is aware of the state of the data at any step and logs this information. These arguments show that data orchestration is a necessary part of robust data journeys and pipelines. We can now turn our attention to the questions of how this can be done and whether specific tools for it are needed in a modern data stack.
The old days: Data orchestration in ETL
In the traditional ETL process transformations usually need to be custom-built by the engineering team, since each data source will need specific implementations adapted to their data formats and schemas.
Originally, the execution of the processes in charge of the extraction, transformation and loading was performed via what we can consider the simplest orchestration method: sequential scheduling via cron jobs. This method does not scale well, both in terms of development complexity and execution efficiency, and requires considerable engineering time to set up, maintain, and fix errors when they inevitably occur. Learn more about the differences between scheduling and orchestrating here.
Using orchestration tools
Modern orchestration tools can be used within an ETL pipeline to coordinate all the in-house tasks, from data extraction all the way to reporting. This is a considerable improvement over a simple scheduling system, but comes at the cost of even more intensive use of engineering labour. This is the case because not only all data extraction, transformations, i.e. business understanding, and loading need to be done by engineers; they also need to build and maintain the dependencies between jobs and the necessary code to run the corresponding complex nonlinear schedules. This makes the inherent problems of ETL systems even more dire. A business needs to have a very clear use case and understanding of the involved trade-offs before committing to follow this path.
In recent years providers have developed complex workflow managers that capture the relationships and dependencies between tasks via the concept of a DAG (Direct Acyclic Graph). By understanding tasks as nodes in a graph and dependencies as edges between these nodes, these systems can execute the respective tasks in more efficient ways while also having better controls for logging, debugging and dealing with failures and retries. Moreover, these new tools work programmatically, leveraging the power of languages like Python, already extensively used in the data landscape, in order to allow engineers to design complex, robust and dynamic orchestration systems. In summary, these modern tools follow the paradigm of workflows-as-code.
Among the first generation of these modern orchestration tools we find Luigi, developed by Spotify, which pioneered the use of Python in this context. Apache Airflow, developed by Airbnb, is a more robust system, which lately has gained major traction among both start-ups and consolidated companies. It is open sourced and sponsored by the Apache Foundation.
A second generation of data workflow managers focuses on following and taking advantage of the data state on top of the usual task management. Among these we find products like Dagster and Prefect, which strive to improve on Airflow’s premises. However, their adoption is still in early phases.
The big players in the cloud space have also developed orchestration tools, e.g. AWS step and lambda functions, AWS Glue and Google Cloud Composer. However, these suffer from a strong vendor lock-in problem. Finally, there are tools originating in the CI/CD space which have been moving towards orchestration, such as Argo and Tekton.
ELT and the future of data orchestration
The modern data stack consists of a collection of tools each specializing on one step in the data journey and excelling at it. Within their range of action they take care of the orchestration of the required internal processes. We can illustrate this with an example covering the full data journey:
- Fivetran internally orchestrates all the workflows required to extract and load data safely, reliably and efficiently. Your clean, normalized data will be in the data warehouse without any need of orchestration on your side. If you need more granular control, you can also use the Fivetran Airflow package to programmatically manage your pipelines.
- For further data transformation and modelling, dbt orchestrates the table creations, updates and testing necessary to create your final data model. Thus, with a simple coordination between the schedules in Fivetran and dbt, you have done the whole ELT without any complex orchestration required on your side.
- Using a modern reporting tool like Looker on top will provide reporting and self-service dashboarding for your business users, again without any orchestration efforts from your side.
Even the minimal scheduling coordination required between your EL and your T can now be eased with Fivetran's dbt packages for data transformation.
It might seem as though in-house orchestration systems have been made totally unnecessary. A considerable number of businesses, especially young companies in growth processes, can completely rely on the tools of a modern data stack to avoid orchestration complexities. However, orchestration tools can still provide important business value for specific use cases.
In-house orchestration use cases
The main use cases for building in-house orchestration systems can be divided in two types:
1. ETL processes are preferable
- If you require maximum control over some part of your data journey, for example because of data security requirements, you might need to build ETL tools in-house, which will come with necessary orchestration tasks. In this case an open source orchestration tool like Airflow can be a great time saver.
- If you have complex in-house sources that you want to export and load on your own, you can have a hybrid system combining SaaS and in-house processes, which can be coordinated by an orchestration tool.
2. Going beyond standard data analytics
- You might have processes requiring custom scripts, be it within your ELT process or if you’re creating data products fed by your data model. For example, apart from dashboards and reports you might want to send data to feed one of your products, as could be the case with an ML model developed in Python to be used in your website, that uses as input a reporting table created in your transformation layer. Read more about “reverse ETL” here.
- If you need to integrate your analytics stack with further data products like Spark batch jobs or Kafka streams, an orchestration tool can treat each system as a workflow to be coordinated.
In all of these cases the data orchestration tool would sit on top of the common modern data stack tools, coordinating them as independent workflows and integrating them into bigger data journeys. Nevertheless, the most common ELT processes keep being coordinated by the specific tools themselves, hence no micro orchestration is required within these workflows.
As we have seen, orchestration is a fundamental part of modern data processes. However, a modern data stack makes specific orchestration tools redundant for plenty of business, because each provider takes care of orchestration tasks internally, via the engineering efforts of their own specialised teams. Even when use cases are valid or necessary, the role of data orchestration tools is diminished and much easier to handle, given that they have to take care of a much smaller number of workflows with less complex interdependencies. This way engineers are free to work on developing the product, instead of reinventing the data wheel. Additionally, minimising engineering involvement in the data journey gives more power and independence to analysts and decision makers to use data in order to improve business outcomes.