What is idempotence?

Idempotence is an important characteristic of many systems, including Fivetran. Here's what it means.

Idempotence means that if you execute an operation multiple times, the result will not change after the initial execution. For example, if the operation is a mathematical function, it means that the output does not change when applied again to its own output:

f(f(x)) = f(x)

Most functions you might think of are not idempotent. If you repeatedly compute f(x) = x + 1, for instance, the result will increase by “1” with every iteration.

Given x = 1

f(x) = 1 + 1 = 2

f(f(x)) = 3

f(f(f(x))) = 4

By contrast, the function f(x) = abs(x) is idempotent.

Given x = -11

f(x) = 11

f(f(x)) = 11

f(f(f(x))) = 11

A real-world example of idempotence is a button for a floor in an elevator. An initial press of the button will cause the elevator to move its passengers toward the specified floor. Subsequent pushes will not change the elevator’s behavior.

What does idempotence mean in the context of data integration?

Suppose we have a data warehouse with the following records in a table:

id name role
1 Andrew analyst
2 Brenda marketer

Let’s say the data source adds the following record and we want to sync it with our data warehouse

id name role
3 Cary product

Our desired output looks like this:

id name role
1 Andrew analyst
2 Brenda marketer
3 Cary product

Specifically, we want to add the record with id 3.

A good practice for syncing data is to update records incrementally, so that we don’t have to sync an entire data set every time there is an update. This means that we must perform the following actions to update our data warehouse:

  1. Look up the previous saved progress, called a cursor, that tells us we have previously synced up to row 2
  2. Based on this, we know we can start after row 2, so the pipeline only syncs the row with id 3
  3. Record a new cursor pointing to id 3 so that the next sync starts after id 3.

Suppose the sync fails after saving the row with id 3 but before recording the state of the data warehouse into a new cursor. This means the pipeline will sync row 3 again on the next attempt. If our data pipeline is idempotent, the subsequent attempt will still produce the desired output with three rows.

Suppose the system is not idempotent. The data pipeline failed before recording the cursor, so we must try again. But because the system is not idempotent, we end up with the following output:

id name role
1 Andrew analyst
2 Brenda marketer
3 Cary product
3 Cary product

We have duplicated the row with id 3. This extraneous record will throw off any queries, analyses or operations that depend on the data. In a real-world data warehouse with billions of rows, the problem scales accordingly.

What happens when you don’t have idempotence?

In the context of data integration, idempotence makes your data pipeline self-correcting. Most importantly, idempotence prevents duplicate records from being loaded. Duplication poses the following dangers to business intelligence and analytics:

  1. Misleading or erroneous insights – e.g. you will not have correct rank orders, counts, or sums of records. An unbalanced duplication will alter medians, averages, distributions, and other metrics.
  2. Broken operations downstream – e.g. any systems that rely on one-to-one correspondence between records and identifiers may throw an error or produce the wrong behavior.
  3. Wasted storage space and computational bandwidth – you just have extra records that don’t add value yet still have to be accounted for in queries (and disk space).

To the extent that a company relies on data to make high-stakes decisions, duplication can lead to the wrong decisions, to say nothing of the additional overhead costs of wasted storage space and rewriting queries to compensate for duplicate records.

We will further discuss the importance of idempotence for data integration in a subsequent piece. Stay tuned!

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

What is idempotence?

What is idempotence?

August 31, 2020
August 31, 2020
What is idempotence?
Idempotence is an important characteristic of many systems, including Fivetran. Here's what it means.

Idempotence means that if you execute an operation multiple times, the result will not change after the initial execution. For example, if the operation is a mathematical function, it means that the output does not change when applied again to its own output:

f(f(x)) = f(x)

Most functions you might think of are not idempotent. If you repeatedly compute f(x) = x + 1, for instance, the result will increase by “1” with every iteration.

Given x = 1

f(x) = 1 + 1 = 2

f(f(x)) = 3

f(f(f(x))) = 4

By contrast, the function f(x) = abs(x) is idempotent.

Given x = -11

f(x) = 11

f(f(x)) = 11

f(f(f(x))) = 11

A real-world example of idempotence is a button for a floor in an elevator. An initial press of the button will cause the elevator to move its passengers toward the specified floor. Subsequent pushes will not change the elevator’s behavior.

What does idempotence mean in the context of data integration?

Suppose we have a data warehouse with the following records in a table:

id name role
1 Andrew analyst
2 Brenda marketer

Let’s say the data source adds the following record and we want to sync it with our data warehouse

id name role
3 Cary product

Our desired output looks like this:

id name role
1 Andrew analyst
2 Brenda marketer
3 Cary product

Specifically, we want to add the record with id 3.

A good practice for syncing data is to update records incrementally, so that we don’t have to sync an entire data set every time there is an update. This means that we must perform the following actions to update our data warehouse:

  1. Look up the previous saved progress, called a cursor, that tells us we have previously synced up to row 2
  2. Based on this, we know we can start after row 2, so the pipeline only syncs the row with id 3
  3. Record a new cursor pointing to id 3 so that the next sync starts after id 3.

Suppose the sync fails after saving the row with id 3 but before recording the state of the data warehouse into a new cursor. This means the pipeline will sync row 3 again on the next attempt. If our data pipeline is idempotent, the subsequent attempt will still produce the desired output with three rows.

Suppose the system is not idempotent. The data pipeline failed before recording the cursor, so we must try again. But because the system is not idempotent, we end up with the following output:

id name role
1 Andrew analyst
2 Brenda marketer
3 Cary product
3 Cary product

We have duplicated the row with id 3. This extraneous record will throw off any queries, analyses or operations that depend on the data. In a real-world data warehouse with billions of rows, the problem scales accordingly.

What happens when you don’t have idempotence?

In the context of data integration, idempotence makes your data pipeline self-correcting. Most importantly, idempotence prevents duplicate records from being loaded. Duplication poses the following dangers to business intelligence and analytics:

  1. Misleading or erroneous insights – e.g. you will not have correct rank orders, counts, or sums of records. An unbalanced duplication will alter medians, averages, distributions, and other metrics.
  2. Broken operations downstream – e.g. any systems that rely on one-to-one correspondence between records and identifiers may throw an error or produce the wrong behavior.
  3. Wasted storage space and computational bandwidth – you just have extra records that don’t add value yet still have to be accounted for in queries (and disk space).

To the extent that a company relies on data to make high-stakes decisions, duplication can lead to the wrong decisions, to say nothing of the additional overhead costs of wasted storage space and rewriting queries to compensate for duplicate records.

We will further discuss the importance of idempotence for data integration in a subsequent piece. Stay tuned!

Verwandte Beiträge

No items found.
No items found.
Fivetran Product Update: October 2024
Blog

Fivetran Product Update: October 2024

Beitrag lesen
Data governance is a top priority for AI success
Blog

Data governance is a top priority for AI success

Beitrag lesen
Build vs. buy for AI: Choosing the right data foundation
Blog

Build vs. buy for AI: Choosing the right data foundation

Beitrag lesen

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.