Every organization needs data agility for an array of reasons: accurate data, consistent data, timely data, etc.
Utilizing modern, database replication tools, businesses can quickly and thoroughly meet those needs and move high-throughputs of data from their database with a change data capture (CDC) solution.
Leveraging a database and ingestion tool, a CDC solution enables fast and capable data movement that's more agile and causes less strain on the source database than batched data movement. This makes CDC an optimal solution for data teams and analysts who need to get data into the cloud and analyze it quickly. In addition, a CDC solution supports real-time or near real-time analytics and has a lower impact on your data sources.
There are several types of CDC on the market—and different CDC solutions have different capabilities. So, when considering which CDC is right for your data environment, here’s what you need to consider.
Type of CDC
There are three types of CDC solutions that can be used to ingest data into your data warehouse:
Most databases built for online transaction processing (OLTP) use a transaction log to record changes. A log-based CDC parses the changes from the transaction log asynchronously from the transactions submitting the changes. A few vendors, including Fivetran, provide log-based CDC through binary log readers. The advantage is that a log reader parses the transaction log directly, with no intermediate API layers that could slow down or limit change data capture.
A trigger-based CDC records changes as they happen. Every insert, delete and update operation not only performs its respective change, but a trigger fires to record the change in a separate change table. The trigger-based approach was preferable before the development of log-based solutions, but triggers-based CDC solutions have a bigger impact on database processing, which most companies want to avoid.
Difference-based CDC requires a “brute-force” comparison of all the data. To perform a full difference analysis, you must pull in all data before comparing it. The difference is always based on two snapshots of the data. This method is only good for relatively small data sets, and even still, only when other options aren’t available.
Log-based CDC is the best choice when moving data from a database into a cloud destination or when moving data from your mission-critical and busiest databases that cannot afford a slowdown in processing. As log-based CDC is a non-intrusive method that doesn’t slow down source systems. Log-based CDC is also the most accurate method and can handle the highest volume of change data in real time. Finally, it lets you keep track of every change to rows and maintain transactional consistency.
The architecture of the CDC solution you select should be able to keep up with the volume of changes you have. Distributed systems allow for more efficient and hence, higher volume replication. The reason that data replication in distributed systems delivers better performance are three-fold:
- The CDC is close to the source, which eliminates sending data that is not required across the wire.
- The architecture allows for compression on just the change data, which optimizes the use of bandwidth.
- The architecture distributes optimum network communication between components of the network to avoid sensitivity to high latency connections.
Data source connections
Many enterprise systems use a myriad of on-premises and cloud-based databases and applications. Therefore, a CDC solution with a broad library of connectors that can automate connecting these data sources to your cloud data warehouse or data lake is important to successfully power your analytics and reporting.
In addition, a capable CDC solution can automate connections to various database technologies, such as a SAP ERP database — and do it much faster than other methods. For example, for Pitney Bowes, Fivetran's log-based change data capture from SAP and Oracle reduced batch loads from 31 hours to under 2 hours.
Managed services and support
Managed services eliminate the time and expense of in-house development and maintenance of multiple source-to-destination data pipelines and should be a key consideration. At a minimum, you should look for a CDC solution that offers global customer support, which is available 24/7/365 with a guaranteed uptime of 99.9 percent.
It’s essential to consider security requirements and needs for any technology solution you may implement. But it’s even more important to consider security when you’re dealing directly with the movement of data, who has access to the data and local regulations about data movement.
In addition to finding a vendor that can offer maximum security and privacy compliance from source to destination, these security features should be part of any CDC solution you use:
- Enterprise-level security
- Data encrypted both in transit and at rest
- Anonymizes personal data
- Data doesn’t persist — it is purged as soon as it’s successfully written to the destination
- Transparency and ability to audit— allows you full control over data access and detailed logging so you can see how and by whom data is handled
- Granular role-based access control to ensure users are given only the level of access necessary
- Secure sign-on
For example, when WeWork became a publicly traded company, it required a stronger security posture. Rather than just pull in data and provide access, they needed to track data movement and user access over time — proving to auditors and regulators that it is keeping customer information safe from internal and external threats.
Fivetran has made it easier for WeWork to maintain data governance and compliance when ingesting data from hundreds of cloud-based and on-premise sources into Snowflake. Fivetran’s log-based CDC provides a formalized framework to pull in this data with a historical record of what and when data has been ingested, including any changes to the database and who has access. It’s also easy to show that the dataset is complete — which is important when tracking reservations, inventory and other metrics.
Choose a data replication tool that delivers all of these features
Fivetran is a modern data integration solution that uses log-based CDC to replicate and move data from the source to your destination of choice. Fivetran has a broad library of 350+ connectors built on a distributed architecture that enables data teams to effortlessly centralize and transform data from hundreds of SaaS applications and on-premises data sources into cloud destinations.
In addition, Fivetran offers enterprise-grade security and governance features that provide transparency and auditability while protecting data movement from the source to the destination. Moreover, the entire data integration process is fully managed and supported by Fivetran global customer support, which is available 24/7/365 with a guaranteed uptime of 99.9 percent, eliminating the time and expense of in-house development and maintenance of multiple source-to-destination data pipelines.
With a CDC solution like Fivetran, you can start moving your data within minutes to drive valuable insights and analytics.