Data insights

Benchmarked: A data pipeline throughput analysis

January 16, 2025

Benchmarked: A data pipeline throughput analysis

Eric O'Connor

Product Manager, Distinguished

Product Manager, Distinguished

,

Fivetran

SUJETS

No items found.

Looking under the hood and load testing Fivetran’s data pipelines.

Fivetran has previously published benchmark data comparing the performance of various data warehouse platforms, but we’ve never published benchmarks measuring the performance of our own pipelines. We are proud to introduce a new system to benchmark and frequently test the throughput of our data pipelines. This feature will enable users to measure and optimize their Fivetran usage, as well as to directly observe the results of our continuing efforts to improve pipeline sync performance.

Replication from operational and transactional databases to data warehouses and data lakes is the most common Fivetran use case — over 4000 customers replicate change data from databases each day. Database workloads are also the most intensive tasks at Fivetran. We built this benchmark to showcase our performance loading from common OLTP relational databases like Oracle, PostgreSQL, SQL Server, and MySQL. TPROC-C — a variant of TPC-C as implemented by HammerDB — is widely supported on the database systems we wanted to benchmark and highlights our ability to ingest data from heavily loaded relational databases. Therefore, we chose TPROC-C as our benchmark standard.

This series will be broken down into two installments. First, we’ll evaluate the throughput of a bulk historical load. Second, we will use the TPROC-C workload to evaluate the day-to-day latency of a high-volume database to Data Warehouse connection.

Spoiler alert: We have demonstrated import performance of over 500 GB/hour on large databases.

But before we jump into specifics, let’s first talk about how the test was set up.

Historical data import

In principle, benchmarks of historical loads involve the import of a static dataset from a source system into a target system. This involves the system performing a variant on select * , a format conversion, and a copy into the destination.

In reality, these queries are more complex to run for two main reasons:

Network failures and other interruptions may prevent the completion of the historical sync. Fivetran addresses this with a “checkpoint” that tracks the most recently imported data point to ensure the system can pick back up where it left off after an interruption.
Many historical loads are so large that they prevent the database from releasing additional transactional data while an import is running. To solve this, Fivetran breaks large tables into consumable chunks to reduce the duration of our transaction during a historical sync.

For our historical benchmark, we are using the initial dataset generated by HammerDB. We configure the dataset to be about 1TB on disk and then perform a Fivetran sync. The throughput we measure is then 1TB divided by the time between the sync starting and the sync ending:

1TB
Throughput = ______________________________
(sync end - sync start)

To make TPROC-C replicate typical customer data volumes (based on observed averages from our service), we augmented the workload by adding additional detail columns to bring the average row width up to about 1KB. This is roughly the same approach used by Gigaom in their benchmark of Fivetran.

Benchmark configuration details

Data set

Generator	TPROC-C
Size on disk	971 GB (~1TB)
Row count	631,396,497
Average row width	1107 bytes
Table count	140

Source system (Oracle)

Instance	GCP E2_HIGHMEM_16
vCPUs	16 cores
Memory	128 GB

Destination system (Snowflake)

Data warehouse size

Snowflake XS

Benchmark results

When running the benchmark with this dataset, we can fully sync the source tables into Snowflake in 1:30:00 - 1:40:00, resulting in the following performance results:

Fivetran has spent significant time in the past year improving our performance and ensuring that we are able to replicate large data volumes in small amounts of time. Some of the improvements include expanding parallelism in our core system and optimizing our data load pattern into common data warehouses.

The biggest implication of this performance increase is faster time to value. Each additional data source that is added is ready to query for analytics, AI/ML operations, and reporting within minutes. Fivetran supports replication from 30+ databases into a variety of target systems. With continuous development into new connectors, faster throughput and lower latency, we ensure we can meet the needs of our customers giving them access to their data when and where they need it.

Stay tuned for a follow-up report detailing the improvements we’ve made to change data capture.

[CTA_MODULE]

‍

Experience Fivetran data pipelines for yourself with a free trial.

Sign up