This is the second post in the series; see part 1 where we report on historical sync performance.

Introduction

In part 1 of our series on data pipeline performance, we examined historical sync benchmarks. Here, we focus on ongoing incremental sync performance—often referred to as change data capture—and show how Fivetran delivers near real-time data under high-volume database workloads.

In order to accomplish that goal, our system is configured with a sync period. Based on that schedule, we will regularly check the source system for changes.

These sets of changes must be integrated into the destination system, where the rows captured during change data capture are added, removed, or updated. Columnar data warehouses require scanning a significant portion of the table to look up or alter existing data, and so we run these updates in micro batches to reduce cost and processing time.

Benchmark design

There are two approaches to measuring the performance of this system. Most database administrators are necessarily tracking the accumulation of change data – such as redo log or write-ahead log – generated by their relational database. In order to keep the data warehouse up to date with those changes, Fivetran must read and process them, applying the insert, update, or delete on the destination system. So, one way to think about performance in this context is in terms of volume and throughput.

We also care about how “fresh” the data is. Sometimes cost optimization means we want the destination copy to lag behind the source system, but modern businesses rely on up to date information to drive decisions. For the purpose of this benchmark, we measure the latency, or “how up to date can we keep the data?”

The combination of these metrics gives us latency at a given throughput. A high-performing system should have latency that is low and stable, while keeping up with the throughput of changes created by the source system.

Simulation

A Fivetran connection graphed over time looks something like this:

To simulate this, we designed a benchmark to run a data pipeline loading data from common OLTP relational databases like Oracle, PostgreSQL, SQL Server, and MySQL. The first step is to pick a workload simulator, and we selected TPROC-C [1] — a variant of TPC-C as implemented by HammerDB — because it is widely supported on the database systems we want to benchmark, and showcases our ability to ingest data from a heavily loaded relational database.

TPROC-C generated significantly less data than is typical for our customer base, so we augmented the workload by adding additional detail columns to bring the average row width up to about 1KB. Note that the change log format used by the RDBMS is significantly less efficient than the “logical” size of the row.

Data set

Generator	TPROC-C
Volume	240 GB/hour — measured as log size on disk
TPS	60M transactions per hour 16.6k transactions per second

Source system (Oracle, single node)

Instance	GCP E2_HIGHMEM_16
vCPUs	16 Cores
Memory	128 GB

Fivetran system

Memory limit (JVM)

8 GB

Destination system

Data warehouse size

Snowflake XS

Results

Running this benchmark for about 2 hours gives us a series of sync cycles. The cycle frequency and batch size here is primarily self-governing based on how long it takes the previous cycle to complete.

Each bar in this graph shows the start and end of a sync cycle. The TPROC-C workload generates about 240 GB/hour of data, and runs for several hours before being turned off about 2 hours after the test begins. Given the periodic nature of a micro-batch workload and the fact that our workload generator is constantly making transactional changes to the source system, we can calculate the longest possible delay in data delivery based on the gap between two syncs. That latency measurement looks like this:

Based on this raw latency data, we are then able to compute industry standard latency percentiles across each run of our test. In most cases, even on this high throughput workload, we are able to deliver fresh data into Snowflake in less than 17 minutes.

We expect the latency to be sensitive to workload size. Here, we have configured a fairly large single node database workload (~16k transactions per second). In our benchmark, we want to exercise the system heavily, but a less intense workload would typically be able to support lower latency than graphed above. In a future post we will follow up on that thread.

Conclusions

We’ve built an automated test suite for measuring realistic incremental sync performance in our system and are looking forward to three things:

We plan on using this system internally to make sure that data pipeline performance is consistently fast. These tests run every day (including some on pre-production software), and so we will quickly become aware of performance regressions and be able to address them before they impact our customers.
We have extended this benchmark framework to more system configurations beyond Oracle and Snowflake and will be sharing those results shortly.
We’re excited to share this data publicly and intend to make the results of our ongoing performance measurement publicly available in a live updating format — similar to https://status.fivetran.com/

Overall, this benchmark shows Fivetran’s ability to keep data current in near real-time, even under sustained high throughput. Thanks for reading, and stay tuned for more performance data!

[CTA_MODULE]

Data insights

Benchmarked: A data pipeline latency analysis

January 29, 2025

Eric O'Connor

Product Manager, Distinguished

Fivetran

Anchor Link