Change data capture (CDC) is a mechanism for identifying and capturing data changes. In practice, CDC is often used to automatically deliver changes that are detected from a data source (such as a database) to a target (such as a cloud data warehouse).
Why do organizations need change data capture?
Fundamentally, CDC enables organizations to capture data on a faster and timelier cadence, and consequently to make informed, data-driven decisions on a faster and timelier cadence. This leads to the following practical benefits:
1) Data governance & compliance
Customers change the data they share with organizations all the time. With regulations such as GDPR, customers have the right to revoke consent, cancel subscriptions and to correct data:
If an individual believes that their personal data is incorrect, incomplete or inaccurate, they have the right to have it rectified or completed without undue delay. If this is the case, you should notify all data recipients if any of the personal data you shared with them has been changed or deleted. If any personal data you shared was incorrect, you may also have to inform anyone who has seen it that this was the case.
Businesses' ability to capture and report that change in a robust and timely manner is critical to ensure they are compliant. Without it you cannot identify when information has been changed, which is required if your business is the subject of an audit.
2) Fast & accurate data-driven decisions
Businesses need to make decisions quickly. According to the “Speed of Data” survey conducted with Dimensional Research, 75% of businesses still rely on batch processing. Capturing data changes across different data sources automatically and efficiently, and propagating those changes to a target destination provides a single source of truth. It ensures you are making the right decisions with the right data.
Nucleus Research explored this topic in their study, “Measuring the Half Life of Data”. The analysis of 47 companies and their decision making tempos shows that data used for tactical decisions (typically incremental choices) is transient and therefore has a half-life of 30 minutes. Operational data isn’t quite as transient, as higher-stakes decisions to change operations typically require analysis to verify value, and so have a half life of 8 hours.
The highest-stake strategic decisions are somewhat less time-sensitive, with a half-life of over two days. The largest data sets, however, can take multiple days to fully sync. CDC obviates the need to fully sync data sets, providing a benefit here as well.
The bottom line? Incremental CDC is simply required for fast, accurate business decisions.
Capturing and tracking change is also critical for the accuracy of your machine learning models as it enables you to calculate all the features you want to use in your model as of every month that you have history for. Fivetran’s history mode automatically records and captures changes to slowing changing dimensions to facilitate accurate machine learning predictions.
There are also many operational efficiency wins that CDC provides such as
- Once data resides in your target destination, analytics teams can perform analysis without impacting your production data sources.
- Only data changes are delivered to the target; saving a huge amount of time and resources.
- CDC captures and delivers data incrementally, which means we can stop thinking about data migrations as long projects with one-off transfers but as a byproduct of change data capture.
3) Change management & risk assessment
For many organizations, financial institutions in particular, capturing changes in customer behaviour is not only for data integrity/analytics purposes, it also triggers a number of business processes. For example, capturing a change of a customer’s address triggers a risk model rerun to assess if that change impacts the holistic risk profile of that customer. Identifying and investigating changes to customer risk profile is critical in order for Financial Institutions to remain compliant with financial crime and AML regulations.
How Fivetran can capture change across multiple data sources
Fivetran offers CDC as a feature for most of our connectors to applications and all of our connectors to databases. For our database connectors, we primarily use log-based replication, although we have also recently acquired a new technology for database replication that offers the completeness of snapshots while approaching the speed of log-based systems.
After the initial sync of your historical data, Fivetran moves to performing incremental updates of any new or modified data from your source database. During incremental syncs, we use your database’s native change capture mechanism to request only the data that has changed since our last sync, including deletes. Each database uses a different change capture mechanism.
During incremental syncs, Fivetran maintains an internal set of progress cursors which allow us to track the exact point where our last successful sync left off. We record the last sync for each row in a column called fivetran_synced (UTC TIMESTAMP). This provides a seamless hand-off between syncs, ensuring that we do not miss any data.
Because of our progress cursors, Fivetran's system is extremely tolerant to service interruptions. If there is an interruption in your service (such as your destination going down), Fivetran will automatically resume syncing exactly where it left off once the issue is resolved, even days or weeks after, as long as log data is still present.
You can track deletions to get a view of your currently active records, or, as Fivetran is developing right now, track full histories for tables.
See how well our approach to CDC works firsthand by signing up for a 14-day free trial and testing our system yourself.