Modern cloud-based applications are in a constant state of change as SaaS providers add new features and build increasingly comprehensive data models. Your typical SaaS API endpoint will periodically update as these changes, ranging from new columns and tables to fundamental reorganizations of the entire schema, are reflected in the data available to the general public.
These updates can completely interrupt a traditional ETL data pipeline. Extraction and transformation scripts that depend on highly specific configurations of data will break upon encountering unexpected elements, meaning that the pipeline never reaches the loading stage. Typically, to accommodate these changes, you will need to completely rebuild the data pipeline. Anyone who has been involved in a data integration project can attest to the engineering hours (and mental anguish) it entails.
From an analyst’s standpoint, these interruptions mean downtime and stale data as syncs are set back by some interval of time. From the standpoint of a data engineer or IT person, this means yet another obstacle to meeting their SLA.
Breaking the cycle
There are two ways to overcome this particular challenge. Both involve ELT rather than ETL. The first, naive approach is to focus on the “EL” part of the process. You can build a system that strictly extracts and loads raw data and is completely agnostic to the underlying data model. This allows you to build a data repository of record that your analysts can revisit whenever the schema changes at the source.
The second way to overcome this challenge is a refinement of the first. The sequence remains ELT, but the entire process is automated and outsourced to a third party. The third party’s role is to develop a fine-grained understanding of the data models underlying every data source in order to develop a normalized schema, and to maintain that data pipeline after it is built.
At Fivetran, we automatically, incrementally, and comprehensively propagate schema changes from the source to your data warehouse. We handle this in a few ways, depending on the situation:
1. If your source adds a new column, we will detect the change and add the same column in our warehouse, backfilling the data if applicable.
2. If a column is removed we won’t delete it outright from your data warehouse but “soft-delete” it and mark future records NULL so that the old data remains as a historical point of reference.
3. If a column’s datatypes change, we will do our best to losslessly accept both the old and new data by retaining the old column and creating a new column with a data type accommodating both the old and new data. You can then create views off that table.
4. If your source adds a new table, we will begin syncing it like any other.
5. If a table is deleted in your source, we will leave it alone in your warehouse.
In short, every change made to the source schema is immediately and losslessly replicated in your data repository of record. We strive to prevent your data pipeline from experiencing any downtime as a result of upstream changes at the source.