Change data capture – known as CDC – is commonly used for replication from databases and for processing data from a variety of data sources, such as SaaS applications, or other systems only accessible through APIs. There are two main reasons why it’s rapidly gaining popularity:
- CDC is a type of data replication that drives powerful data-driven use cases. Data replication allows for consolidated access to operational data for real-time analytics, data streaming and machine learning use cases.
- CDC enables data to be transferred as changes happen. Organizations operate critical systems 24/7 with no convenient time for batch windows, and impact on database processing directly hurts the primary business process. Because critical systems contain the most important data to help drive decisions, real-time access to this data is required to be more competitive.
Change data capture methods
This blog focuses on relational databases, which are commonly used for operational data processing. There are multiple types of change data capture that can be used for data processing from a database.
Most databases built for online transaction processing (OLTP) use a transaction log to record changes. In the case of a system or database crash, the transaction log enables loss-less database recovery. Fundamentally, all committed and recoverable database changes can be found in the transaction log.
Log-based change data capture parses the changes from the transaction log, asynchronously from the transactions submitting the changes. Some database technologies provide an API for log-based CDC. Others don't, and in-depth expertise is required to get changes out.
Few technology vendors, including Fivetran, provide log-based CDC through so-called binary log readers. A binary log reader parses the transaction log directly, with no intermediate API layers that could slow down or limit change data capture. A binary log reader has the lowest possible impact on database processing and is the most flexible in its deployment options. For example, binary log reading may be performed on a standby system or based on the backups of the log.
Some data replication use cases require transactional consistency in line with the data source. Log-based CDC allows for this by default because transaction boundaries and commit order are in the transaction log.
Furthermore, log-based CDC allows you to go back in time and retrieve historical changes.
Good for: Your most mission-critical and busiest databases that cannot afford a slowdown in database processing. These are the systems with the most important data sets that you want to include in analytical use cases. Log-based CDC handles the highest volumes of change data in real-time.
When to consider other types of CDC: Some technologies or deployments such as cloud-deployed databases or applications delivered as a hosted solution or appliance, simply don't allow for log-based CDC.
Trigger-based CDC is an approach to record changes as they happen. Every insert, update and delete operation not only performs its respective change, but a trigger fires to record the change in a separate change table.
The trigger-based approach was preferred before log-based solutions were developed. Nowadays organizations want to avoid the impact of the triggers on database processing. However, trigger-based CDC works well, and even SAP’s native solution for CDC from SAP HANA still relies on it.
The two most common implementations for trigger-based capture:
- Record the change in its entirety. Doing so allows for access to the full history of the changes but effectively doubles the amount of data written to the database for the originating transaction. But with all change data in separate tables, retrieving it is simple.
- Record only the identifying key for the row, and the type of operation (insert, update or delete). This approach has less impact on the originating transaction. However, knowing just the key and type of operation is not enough to reconstruct the entire history of changes. Also, data extraction takes up extra resources due to the required join between the change table and the base table to retrieve then-current values.
Good for: Use cases when transaction performance is not critical, and your system has adequate processing capacity available.
When to consider other types of CDC: If your database processing cannot afford the overhead of triggers due to critical data processing requirements or high volumes. Also, off-the-shelf applications may not allow adding triggers. If your use case requires transactional consistency, then trigger-based CDC is not a good option either.
CDC based on timestamps, or another low watermark
Some applications record when every row change is performed in a separate column (e.g. LAST_MODIFIED). Changes can be extracted by simply keeping track of when changes were most recently extracted and filtering on the respective column. A major disadvantage of this approach is the inability to identify deleted rows.
Some technologies naturally store a low watermark in every row. For example PostgreSQL uses XMIN. Provided that the value monotonically increases, CDC can use such an alternative low watermark as well.
Good for: Applications with a reliable low watermark column on all tables. Timestamp or low watermark-based CDC is an alternative when other options including log-based or trigger-based CDC are not available.
When to consider other types of CDC: If you require the processing of deletes. If exact transactional consistency is important, then timestamp or low watermark-based CDC is generally not a good option.
With difference-based change data capture is identified by a brute force comparison of all the data. To perform a full difference analysis, all data has to be pulled to be compared. The difference is always based on two snapshots of the data.
Good for: Relatively small data sets, when other options are not available.
When to consider other types of CDC: If your application requires the full history of changes to the data, or if your use case requires exact transactional consistency.
Change data capture with Fivetran
Fivetran is a modern data integration solution that is built around the concept of CDC. It provides 180+ pre-built connectors that enables data teams to effortlessly centralize and transform data from hundreds of SaaS and on-prem data sources into cloud destinations. Fivetran’s connectors have been optimized with the right CDC technology to ensure efficient, high volume data movement for various deployment options.
Start your free 14-day trial today and see how Fivetran simplifies and accelerates data movement between your applications.