Idempotence prevents the creation of duplicate data when data syncs fail. For the sake of efficiency, data warehouses load records by batches rather than individual records. This also means that progress is recorded by the batch rather than individual record. When a data integration sync is interrupted, it is often impossible to pinpoint the precise record that was being processed at the time of failure, which means that in order to resume the sync, the data pipeline must start at the beginning of the most recent batch. This means that some data that has already been processed must be processed again.
Challenges of resuming a broken sync without idempotence
Suppose you have the following four batches of data to be synced to a data warehouse. The first batch has been completed, and the sync (green bar) is partway through the second batch. The cursor (red arrow) that tracks progress is currently set to the end of the first batch.
All records from the first batch (“foo,” “bar,” “baz,” “qux,” and “corge”), and some from the second batch (“grault,” “garply,” and “waldo”), have been loaded:
Then, the sync is interrupted:
When it resumes again, the data pipeline refers to the cursor, recognizing that the first batch has been completed but that the second batch has not. The second batch must restart.
As the second batch completes, the cursor is updated to the end of the second batch.
Records that were previously synced from the second batch (in this case, “grault,” “garply,” and “waldo”) are reintroduced to the destination (red rows). Without idempotence, the records are duplicated:
By contrast, with idempotence, every unique record is properly identified and there is no duplication.
Failures can be “hard” or “soft.” The illustrations above refer to a hard failure, meaning the program quits due to an exception, forcing the system to resume again from a place indicated by the cursor.
A “soft” failure is more pernicious inasmuch as the data sync does not fail but corrupted data is loaded to the destination. Soft failures require human intervention and thorough investigation to identify a cursor from the past to roll back to. Without idempotence, this can mean that both the correct and incorrect values of the corrupted records coexist in the destination at the same time, not to mention duplicates of any correct records within the resynced range.
Where can data integration fail?
Failures can take place at any of the following three stages of the modern data stack:
- Source: A data source unexpectedly becomes unavailable, interrupting the sync.
- Pipeline: Sometimes the pipeline itself hiccups and stops working.
- Destination: Failed queries to a data warehouse can interrupt syncs, as can upgrades and migrations.
The common theme to all of these failures, as illustrated above, is that data values that have already been loaded are reintroduced to the destination, potentially creating duplicate or conflicting values.
Sources can become temporarily or permanently unavailable. Temporary failures can result from network or query failures. Permanent failures can cause failovers, meaning that a resource unexpectedly becomes completely unavailable, causing that resource to switch to a secondary, backup copy of itself. As the pipeline switches to the backup resource, some records that were already fetched may be fetched again.
Pipeline failures are the most common type of failure. In general, pipeline failures are the result of:
- Infrastructure stoppages (i.e. servers going down)
- Wrong or missing credentials
- Resource limitations (i.e. memory leaks)
- Software bugs
Anyone who has operated a data pipeline, particularly a traditional ETL flow, knows how common failures can be in practice. Even simpler and more durable ELT pipelines periodically suffer stoppages.
Destination failures and migrations
Syncs sometimes fail at the destination as a result of failed queries. Queries can fail because of resource constraints – a node may be too busy, or there may be scheduled downtime.
In addition, organizations periodically upgrade or migrate from one destination to another. This process typically involves maintaining the data pipeline running to the old destination so that it remains operational while the new one is being configured. This new destination will be populated with a copy of the data from the old destination, but in order to ensure that no updates are lost, you will need to rewind the cursor such that some of the same records will be reintroduced to the destination.
Idempotence means self-correction
The common theme to each failure described above is that recovery requires rolling back to some previous cursor, reintroducing some number of records to a destination. Without idempotence, this reintroduction will cause records to be duplicated, undermining the integrity of the data repository. Human intervention is usually necessary in order to fix these data integrity issues.
With idempotence and the ability to uniquely identify records, there is no duplication or any of the associated downstream problems. Although source, pipeline, and destination failures are inevitable, an idempotent data pipeline is essentially self-correcting, largely eliminating the need for humans to engage in data integrity forensics.