Benchmarked: A data pipeline latency analysis

Learn how Fivetran measures the latency of our pipelines.
January 29, 2025

This is the second post in the series; see part 1 where we report on historical sync performance.

Introduction

In part 1 of our series on data pipeline performance, we examined historical sync benchmarks. Here, we focus on ongoing incremental sync performance—often referred to as change data capture—and show how Fivetran delivers near real-time data under high-volume database workloads.

In order to accomplish that goal, our system is configured with a sync period. Based on that schedule, we will regularly check the source system for changes. 

These sets of changes must be integrated into the destination system, where the rows captured during change data capture are added, removed, or updated. Columnar data warehouses require scanning a significant portion of the table to look up or alter existing data, and so we run these updates in micro batches to reduce cost and processing time. 

Benchmark design

There are two approaches to measuring the performance of this system. Most database administrators are necessarily tracking the accumulation of change data – such as redo log or write-ahead log – generated by their relational database. In order to keep the data warehouse up to date with those changes, Fivetran must read and process them, applying the insert, update, or delete on the destination system. So, one way to think about performance in this context is in terms of volume and throughput. 

We also care about how “fresh” the data is. Sometimes cost optimization means we want the destination copy to lag behind the source system, but modern businesses rely on up to date information to drive decisions. For the purpose of this benchmark, we measure the latency, or “how up to date can we keep the data?”

The combination of these metrics gives us latency at a given throughput. A high-performing system should have latency that is low and stable, while keeping up with the throughput of changes created by the source system. 

Simulation

A Fivetran connection graphed over time looks something like this: 

To simulate this, we designed a benchmark to run a data pipeline loading data from common OLTP relational databases like Oracle, PostgreSQL, SQL Server, and MySQL. The first step is to pick a workload simulator, and we selected TPROC-C [1] — a variant of TPC-C as implemented by HammerDB — because it is widely supported on the database systems we want to benchmark, and showcases our ability to ingest data from a heavily loaded relational database. 

TPROC-C generated significantly less data than is typical for our customer base, so we augmented the workload by adding additional detail columns to bring the average row width up to about 1KB. This is roughly the same approach used by Gigaom in their benchmark of Fivetran and Qlik. Note that the change log format used by the RDBMS is significantly less efficient than the “logical” size of the row. 

Data set

Generator TPROC-C
Volume 240 GB/hour — measured as log size on disk
TPS 60M transactions per hour
16.6k transactions per second

Source system (Oracle, single node)

Instance GCP E2_HIGHMEM_16
vCPUs 16 Cores
Memory 128 GB

Fivetran system

Memory limit (JVM) 8 GB

Destination system

Data warehouse size Snowflake XS

Results

Running this benchmark for about 2 hours gives us a series of sync cycles. The cycle frequency and batch size here is primarily self-governing based on how long it takes the previous cycle to complete.

Each bar in this graph shows the start and end of a sync cycle. The TPROC-C workload generates about 240 GB/hour of data, and runs for several hours before being turned off about 2 hours after the test begins. Given the periodic nature of a micro-batch workload and the fact that our workload generator is constantly making transactional changes to the source system, we can calculate the longest possible delay in data delivery based on the gap between two syncs. That latency measurement looks like this: 

Based on this raw latency data, we are then able to compute industry standard latency percentiles across each run of our test. In most cases, even on this high throughput workload, we are able to deliver fresh data into Snowflake in less than 17 minutes. 

We expect the latency to be sensitive to workload size. Here, we have configured a fairly large single node database workload (~16k transactions per second). In our benchmark, we want to exercise the system heavily, but a less intense workload would typically be able to support lower latency than graphed above. In a future post we will follow up on that thread. 

Conclusions

We’ve built an automated test suite for measuring realistic incremental sync performance in our system and are looking forward to three things: 

  1. We plan on using this system internally to make sure that data pipeline performance is consistently fast. These tests run every day (including some on pre-production software), and so we will quickly become aware of performance regressions and be able to address them before they impact our customers. 
  2. We have extended this benchmark framework to more system configurations beyond Oracle and Snowflake and will be sharing those results shortly. 
  3. We’re excited to share this data publicly and intend to make the results of our ongoing performance measurement publicly available in a live updating format — similar to https://status.fivetran.com/ 

Overall, this benchmark shows Fivetran’s ability to keep data current in near real-time, even under sustained high throughput. Thanks for reading, and stay tuned for more performance data! 

[CTA_MODULE]

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

Benchmarked: A data pipeline latency analysis

Benchmarked: A data pipeline latency analysis

January 29, 2025
January 29, 2025
Benchmarked: A data pipeline latency analysis
Learn how Fivetran measures the latency of our pipelines.

This is the second post in the series; see part 1 where we report on historical sync performance.

Introduction

In part 1 of our series on data pipeline performance, we examined historical sync benchmarks. Here, we focus on ongoing incremental sync performance—often referred to as change data capture—and show how Fivetran delivers near real-time data under high-volume database workloads.

In order to accomplish that goal, our system is configured with a sync period. Based on that schedule, we will regularly check the source system for changes. 

These sets of changes must be integrated into the destination system, where the rows captured during change data capture are added, removed, or updated. Columnar data warehouses require scanning a significant portion of the table to look up or alter existing data, and so we run these updates in micro batches to reduce cost and processing time. 

Benchmark design

There are two approaches to measuring the performance of this system. Most database administrators are necessarily tracking the accumulation of change data – such as redo log or write-ahead log – generated by their relational database. In order to keep the data warehouse up to date with those changes, Fivetran must read and process them, applying the insert, update, or delete on the destination system. So, one way to think about performance in this context is in terms of volume and throughput. 

We also care about how “fresh” the data is. Sometimes cost optimization means we want the destination copy to lag behind the source system, but modern businesses rely on up to date information to drive decisions. For the purpose of this benchmark, we measure the latency, or “how up to date can we keep the data?”

The combination of these metrics gives us latency at a given throughput. A high-performing system should have latency that is low and stable, while keeping up with the throughput of changes created by the source system. 

Simulation

A Fivetran connection graphed over time looks something like this: 

To simulate this, we designed a benchmark to run a data pipeline loading data from common OLTP relational databases like Oracle, PostgreSQL, SQL Server, and MySQL. The first step is to pick a workload simulator, and we selected TPROC-C [1] — a variant of TPC-C as implemented by HammerDB — because it is widely supported on the database systems we want to benchmark, and showcases our ability to ingest data from a heavily loaded relational database. 

TPROC-C generated significantly less data than is typical for our customer base, so we augmented the workload by adding additional detail columns to bring the average row width up to about 1KB. This is roughly the same approach used by Gigaom in their benchmark of Fivetran and Qlik. Note that the change log format used by the RDBMS is significantly less efficient than the “logical” size of the row. 

Data set

Generator TPROC-C
Volume 240 GB/hour — measured as log size on disk
TPS 60M transactions per hour
16.6k transactions per second

Source system (Oracle, single node)

Instance GCP E2_HIGHMEM_16
vCPUs 16 Cores
Memory 128 GB

Fivetran system

Memory limit (JVM) 8 GB

Destination system

Data warehouse size Snowflake XS

Results

Running this benchmark for about 2 hours gives us a series of sync cycles. The cycle frequency and batch size here is primarily self-governing based on how long it takes the previous cycle to complete.

Each bar in this graph shows the start and end of a sync cycle. The TPROC-C workload generates about 240 GB/hour of data, and runs for several hours before being turned off about 2 hours after the test begins. Given the periodic nature of a micro-batch workload and the fact that our workload generator is constantly making transactional changes to the source system, we can calculate the longest possible delay in data delivery based on the gap between two syncs. That latency measurement looks like this: 

Based on this raw latency data, we are then able to compute industry standard latency percentiles across each run of our test. In most cases, even on this high throughput workload, we are able to deliver fresh data into Snowflake in less than 17 minutes. 

We expect the latency to be sensitive to workload size. Here, we have configured a fairly large single node database workload (~16k transactions per second). In our benchmark, we want to exercise the system heavily, but a less intense workload would typically be able to support lower latency than graphed above. In a future post we will follow up on that thread. 

Conclusions

We’ve built an automated test suite for measuring realistic incremental sync performance in our system and are looking forward to three things: 

  1. We plan on using this system internally to make sure that data pipeline performance is consistently fast. These tests run every day (including some on pre-production software), and so we will quickly become aware of performance regressions and be able to address them before they impact our customers. 
  2. We have extended this benchmark framework to more system configurations beyond Oracle and Snowflake and will be sharing those results shortly. 
  3. We’re excited to share this data publicly and intend to make the results of our ongoing performance measurement publicly available in a live updating format — similar to https://status.fivetran.com/ 

Overall, this benchmark shows Fivetran’s ability to keep data current in near real-time, even under sustained high throughput. Thanks for reading, and stay tuned for more performance data! 

[CTA_MODULE]

Experience Fivetran data pipelines for yourself with a free trial.
Sign up

Verwandte Beiträge

Benchmarked: A data pipeline throughput analysis
Data insights

Benchmarked: A data pipeline throughput analysis

Beitrag lesen
Cloud Data Warehouse-Benchmark
Data insights

Cloud Data Warehouse-Benchmark

Beitrag lesen
GigaOm benchmark report: Fivetran HVR vs Qlik Replicate
Data insights

GigaOm benchmark report: Fivetran HVR vs Qlik Replicate

Beitrag lesen
No items found.
Scale secure data movement with Hybrid Deployment and Kubernetes
Blog

Scale secure data movement with Hybrid Deployment and Kubernetes

Beitrag lesen
Hybrid solutions to modernize legacy systems in financial services
Blog

Hybrid solutions to modernize legacy systems in financial services

Beitrag lesen
Fivetran Product Update: February 2025
Blog

Fivetran Product Update: February 2025

Beitrag lesen

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.