We have previously established that idempotence is crucial from the standpoint of failure-proofing your data pipeline, and that it depends on the ability to correctly identify unique records. Identifying unique records prevents duplicate or conflicting data from being produced when the same records are sent twice. Primary keys are a field or combination of fields within a record that are unique to that record and relatively immutable. This means you can use primary keys to identify unique records. Here, we will discuss the exact mechanisms behind correctly identifying primary keys, and using them to ensure extractions without duplications.
Two of the most common types of data sources that organizations extract data from are operational databases and API feeds. In order to construct idempotent data pipelines from both databases and API endpoints, we need to correctly identify primary keys. Databases and API endpoints differ somewhat in how they identify the primary key and read updated data in general.
Integrating data from databases
In databases, the primary key is generally explicitly declared, allowing easy programmatic identification of primary keys.
An initial sync from a relational database requires the use of a SELECT * query, as it must return all records and all values. By contrast, subsequent syncs should not retrieve entire tables. Instead, a data connector for a database should query the changelog. The challenge posed by changelogs is that they do not contain original data, only a list of updates. These updates may be whole-row, in which a primary key and all values are tracked, offering a full snapshot of each record. They may also be partial-row, in which only a primary key and changed values are tracked. Either way, entries from changelogs with existing primary keys are merged with corresponding rows in the destination, while entries with new primary keys are added as new rows.
Without primary keys and idempotence, the destination will create new rows for all log entries, duplicating some rows in the corresponding tables.
When primary keys are not explicitly declared, the data pipeline can’t programmatically identify them. Without a primary key, the only alternative is to impute a primary key by hashing the contents of the entire row. The problem here is that the data pipeline can only prevent exact duplication. It has no way to prevent new, conflicting rows from being created as individual values in a row change.
Integrating data from a SaaS API
With an API endpoint, there is typically no programmatic way to identify primary keys. Instead, they are found in the documentation. If primary keys are not well-documented, the alternative is to construct one by choosing a field or combination of fields that are fixed and unique to a particular entity. This can be highly error-prone, as fields that appear to be fixed and unique, such as email addresses, may not actually be. In order to determine a usable primary key, you have to understand the exact behavior of the application that uses the data, specifically which fields can and can’t be changed.
For instance, a naive and tempting approach is to identify a user with contact information such as username or email address. But don’t take for granted that usernames and emails can’t change. Consider a table with a sample of basic user data:
- Username
- Name
- Phone number
- Address
There are plenty of practical reasons why a person might change any of these (moving, marriage, new phone plan, etc.). You must use a more-or-less fixed reference point for a unique identifier, and therefore determine exactly which fields are fixed within that particular system and which are not. In addition, you will need to ensure that any fields you use as primary keys don’t (and won’t ever) contain null values.
Consider the following record, where you guess that email is the primary key, but both email and name are mutable at the source:
Kostenlos starten
Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.