As modern data sets grow larger and applications become more time-sensitive, traditional ETL has had a hard time keeping up. So how can databases adapt to real-time requirements?
Enter change data capture (CDC). With CDC, data is transferred in small, real-time increments rather than bulk loads or batch windows. With this capability, CDC helps you make faster and more accurate decisions with real-time data movement.
In this quick guide, we’ll run through the basics of CDC, its benefits and some common methods for implementing it in relational databases.
What is change data capture (CDC)?
Change data capture (CDC) is commonly used for replication from databases and processing data from various data sources, such as SaaS applications or other systems only accessible through APIs.
There are two main reasons why CDC is rapidly gaining in popularity:
- CDC drives powerful data-driven use cases as a type of data replication. Data replication allows for consolidated access to operational data for real-time analytics, data streaming and machine learning use cases.
- CDC enables data to be transferred in real-time as changes happen. That makes it invaluable for organizations operating critical systems 24/7 with no convenient time for batch windows, especially where the impact on database processing directly hurts the primary business process. Because critical systems contain the most important data to help drive decisions, real-time access to this data is required to be more competitive.
ETL and change data capture
CDC may sound novel, but it’s actually just an improvement of the familiar ETL (Extract, Transform, Load) process used to load data into a data warehouse or data lake.
Let’s walk through and see what CDC looks like throughout each stage of the ETL process.
Extract
During the extract stage, CDC extracts data in real-time (or near-real time) and provides a continuous stream of change data.
Traditionally, this process is performed in batches, where a single database query extracts a large amount of data in bulk. While this certainly gets the job done, it quickly becomes inefficient as source databases are continuously updated.
By having to refresh a replica of source tables every time, the target table may not accurately reflect the current state of the source application. CDC sidesteps this problem by maintaining a data stream in real-time.
Transform
CDC also presents new efficiencies at the transform stage.
Traditionally, entire data sets need to be transformed using ETL tools to match the structure and format of the target table or repository before loading. While this is still true with CDC, it doesn’t attempt to transform large batches of data at once.
Instead, data is continuously loaded as the source changes and then transformed in the target data repository. With the ever-increasing size of modern data, this approach isn’t just more efficient — it’s necessary to keep up.
Load
As you may have gathered in the “transform” section, load and transform occur almost simultaneously with CDC. If anything, load occurs before transform, as most cloud-based target repositories (e.g., data warehouse, data lake, etc.) handle the transformation.
Benefits of change data capture
Making CDC part of your data integrations presents a number of unique benefits — and not just for ETL. Generally speaking, CDC’s real-time operations allow you and your organization to integrate, analyze and use data faster than ever before.
Here are a few ways CDC’s efficiencies can benefit your data integrations.
Real-time operations (i.e., no more bulk loading)
With no more batch windows or bulk loading, ETL occurs in real time and with better communication between data repositories and sources. This also reduces the impact on source extracts, which no longer need to be refreshed all at once.
Reduced impact on system resources
The key principle of CDC is that it transfers data in tiny increments rather than bulk loads. This allows it to work in real time and greatly reduces the impact on system resources that would otherwise be over-consumed during bulk loading.
Faster database migrations with no downtime
CDC’s ability to move data quickly and efficiently also allows for real-time database migrations without any of the downtime. This also allows for real-time analytics, synchronization and other applications across distributed systems.
Synchronization across multiple data systems
As an extension of the above, CDC’s real-time data transfer also allows multiple data systems to stay synchronized regardless of where they’re located. This is an especially invaluable benefit for time-sensitive applications and environments.
Change data capture methods
There are many different methods for implementing change data capture in different types of databases. To keep things straightforward, we’ll focus on relational databases, which are commonly used for operational data processing.
There are multiple types of change data capture that can be used for data processing from a database. These include log-based CDC, trigger-based CDC, CDC based on timestamps and difference-based CDC.
Log-based change data capture
Most databases built for online transaction processing (OLTP) use a transaction log to record changes. In the case of a system or database crash, the transaction log enables loss-less database recovery. Fundamentally, all committed and recoverable changes can be found in the database transaction logs.
A log-based capture mechanism parses the changes from the transaction log, asynchronously from the transactions submitting the changes. Some database technologies provide an API for log-based CDC. Others don't, and in-depth expertise is required to get changes out.
Few vendors, including Fivetran, provide log-based CDC through so-called binary log readers. A binary log reader parses the transaction log directly, with no intermediate API layers that could slow down or limit change data capture.
Binary log readers also have minimal impact on database processing and offer the most flexibility in deployment options. For example, binary log reading may be performed on a standby system or based on the backups of the log.
Some data replication use cases require transactional consistency in line with the data source tables. Log-based CDC allows for this by default because transaction boundaries and commit order are in the transaction log.
Further, log-based CDC allows you to go back in time and retrieve historical changes.
- Good for: Your most mission-critical and busiest databases that cannot afford a slowdown in database processing. These are the systems with the most important data sets that you want to include in analytical use cases. Log-based CDC handles the highest volumes of change data in real time.
- When to consider other types of CDC: Some database technologies or deployments, such as cloud-deployed databases or applications delivered as a hosted solution or appliance, simply don't allow for log-based CDC.
Trigger-based change data capture
Trigger-based CDC is an approach to recording changes as they happen. Every insert, delete and update operation not only performs its respective change, but a trigger fires to record the change in a separate change table.
The trigger-based approach was preferred before log-based solutions were developed. Nowadays, organizations want to avoid the impact of triggers on database processing. However, trigger-based CDC still works well, and even SAP’s native solution for CDC from SAP HANA relies on it.
There are two common implementations for trigger-based capture:
- Record the change in its entirety. Doing so allows for access to the full history of the changes but effectively doubles the amount of data written to the database for the originating transaction. But with all change data in separate tables, retrieving it is simple.
- Record only the identifying key for the row and the type of operation (insert, update or delete). This approach has minimal impact on the originating transaction. However, knowing just the key and type of operation is not enough to reconstruct the entire history of changes. Data extraction often takes up extra resources due to the required join between the change table and the base table to retrieve then-current values.
- Good for: Use cases when transaction and database performance are not critical and your system has adequate processing capacity.
- When to consider other types of CDC: If your database processing cannot afford the overhead of database triggers due to critical data processing requirements or high volumes. Also, off-the-shelf applications may not allow adding triggers. If your use case requires transactional consistency, then trigger-based CDC is not a good option either.
Change data capture based on timestamps (or another low watermark)
Some applications record when every row change is performed in a separate column (e.g., LAST_MODIFIED). Changes can be extracted by simply keeping track of when changes were most recently extracted and filtering on the respective column.
A major disadvantage of this approach is the inability to identify deleted rows. Some technologies naturally store a low watermark in every row. For example PostgreSQL uses XMIN.
If the value monotonically increases, CDC can also use such an alternative low watermark.
- Good for: Applications with a reliable low watermark column on all tables. Timestamp or low watermark-based CDC is an alternative when other options, including log-based or trigger-based CDC, are unavailable.
- When to consider other types of CDC: If you require the processing of deletes. If exact transactional consistency is important, then timestamp or low watermark-based CDC is generally not a good option.
Difference-based change data capture
Difference-based change data capture is identified by a brute force comparison of all the data. To perform a full difference analysis, all data has to be pulled to be compared. The difference is always based on two snapshots of the data.
- Good for: Relatively small data sets when other options are not available.
- When to consider other types of CDC: If your application requires the full history of changes to the data or if your use case requires exact transactional consistency.
Change data capture with Fivetran
Fivetran is a modern data integration solution that is built around the concept of CDC.
Our platform provides 200+ pre-built connectors that enable data teams to effortlessly centralize and transform data from hundreds of SaaS and on-premises data sources into cloud destinations. Fivetran’s connectors are also optimized with CDC technology, ensuring efficient, high-volume data movement for various deployment options.
Ready to experience the benefits of CDC? Sign up today to begin your free trial.