Benchmarked: A data pipeline throughput analysis

Looking under the hood and load testing Fivetran’s data pipelines.
January 16, 2025

Fivetran has previously published benchmark data comparing the performance of various data warehouse platforms, but we’ve never published benchmarks measuring the performance of our own pipelines. We are proud to introduce a new system to benchmark and frequently test the throughput of our data pipelines. This feature will enable users to measure and optimize their Fivetran usage, as well as to directly observe the results of our continuing efforts to improve pipeline sync performance.

Replication from operational and transactional databases to data warehouses and data lakes is the most common Fivetran use case — over 4000 customers replicate change data from databases each day. Database workloads are also the most intensive tasks at Fivetran. We built this benchmark to showcase our performance loading from common OLTP relational databases like Oracle, PostgreSQL, SQL Server, and MySQL. TPROC-C — a variant of TPC-C as implemented by HammerDB — is widely supported on the database systems we wanted to benchmark and highlights our ability to ingest data from heavily loaded relational databases. Therefore, we chose TPROC-C as our benchmark standard. 

This series will be broken down into two installments. First, we’ll evaluate the throughput of a bulk historical load. Second, we will use the TPROC-C workload to evaluate the day-to-day latency of a high-volume database to Data Warehouse connection. 

Spoiler alert: We have demonstrated import performance of over 500 GB/hour on large databases. 

But before we jump into specifics, let’s first talk about how the test was set up. 

Historical data import

In principle, benchmarks of historical loads involve the import of a static dataset from a source system into a target system. This involves the system performing a variant on select * , a format conversion, and a copy into the destination. 

In reality, these queries are more complex to run for two main reasons:

  1. Network failures and other interruptions may prevent the completion of the historical sync. Fivetran addresses this with a “checkpoint” that tracks the most recently imported data point to ensure the system can pick back up where it left off after an interruption.
  2. Many historical loads are so large that they prevent the database from releasing additional transactional data while an import is running. To solve this, Fivetran breaks large tables into consumable chunks to reduce the duration of our transaction during a historical sync. 

For our historical benchmark, we are using the initial dataset generated by HammerDB. We configure the dataset to be about 1TB on disk and then perform a Fivetran sync. The throughput we measure is then 1TB divided by the time between the sync starting and the sync ending:

                                           1TB
Throughput = ______________________________
                             (sync end - sync start)

To make TPROC-C replicate typical customer data volumes (based on observed averages from our service), we augmented the workload by adding additional detail columns to bring the average row width up to about 1KB. This is roughly the same approach used by Gigaom in their benchmark of Fivetran. 

Benchmark configuration details

Data set

Generator TPROC-C
Size on disk 971 GB (~1TB)
Row count 631,396,497
Average row width 1107 bytes
Table count 140

Source system (Oracle)

Instance GCP E2_HIGHMEM_16
vCPUs 16 cores
Memory 128 GB

Destination system (Snowflake)

Data warehouse size Snowflake XS

Benchmark results

When running the benchmark with this dataset, we can fully sync the source tables into Snowflake in 1:30:00 - 1:40:00, resulting in the following performance results: 

Fivetran has spent significant time in the past year improving our performance and ensuring that we are able to replicate large data volumes in small amounts of time. Some of the improvements include expanding parallelism in our core system and optimizing our data load pattern into common data warehouses. 

The biggest implication of this performance increase is faster time to value. Each additional data source that is added is ready to query for analytics, AI/ML operations, and reporting within minutes. Fivetran supports replication from 30+ databases into a variety of target systems. With continuous development into new connectors, faster throughput and lower latency, we ensure we can meet the needs of our customers giving them access to their data when and where they need it.

Stay tuned for a follow-up report detailing the improvements we’ve made to change data capture. 

[CTA_MODULE]

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

Benchmarked: A data pipeline throughput analysis

Benchmarked: A data pipeline throughput analysis

January 16, 2025
January 16, 2025
Benchmarked: A data pipeline throughput analysis
Looking under the hood and load testing Fivetran’s data pipelines.

Fivetran has previously published benchmark data comparing the performance of various data warehouse platforms, but we’ve never published benchmarks measuring the performance of our own pipelines. We are proud to introduce a new system to benchmark and frequently test the throughput of our data pipelines. This feature will enable users to measure and optimize their Fivetran usage, as well as to directly observe the results of our continuing efforts to improve pipeline sync performance.

Replication from operational and transactional databases to data warehouses and data lakes is the most common Fivetran use case — over 4000 customers replicate change data from databases each day. Database workloads are also the most intensive tasks at Fivetran. We built this benchmark to showcase our performance loading from common OLTP relational databases like Oracle, PostgreSQL, SQL Server, and MySQL. TPROC-C — a variant of TPC-C as implemented by HammerDB — is widely supported on the database systems we wanted to benchmark and highlights our ability to ingest data from heavily loaded relational databases. Therefore, we chose TPROC-C as our benchmark standard. 

This series will be broken down into two installments. First, we’ll evaluate the throughput of a bulk historical load. Second, we will use the TPROC-C workload to evaluate the day-to-day latency of a high-volume database to Data Warehouse connection. 

Spoiler alert: We have demonstrated import performance of over 500 GB/hour on large databases. 

But before we jump into specifics, let’s first talk about how the test was set up. 

Historical data import

In principle, benchmarks of historical loads involve the import of a static dataset from a source system into a target system. This involves the system performing a variant on select * , a format conversion, and a copy into the destination. 

In reality, these queries are more complex to run for two main reasons:

  1. Network failures and other interruptions may prevent the completion of the historical sync. Fivetran addresses this with a “checkpoint” that tracks the most recently imported data point to ensure the system can pick back up where it left off after an interruption.
  2. Many historical loads are so large that they prevent the database from releasing additional transactional data while an import is running. To solve this, Fivetran breaks large tables into consumable chunks to reduce the duration of our transaction during a historical sync. 

For our historical benchmark, we are using the initial dataset generated by HammerDB. We configure the dataset to be about 1TB on disk and then perform a Fivetran sync. The throughput we measure is then 1TB divided by the time between the sync starting and the sync ending:

                                           1TB
Throughput = ______________________________
                             (sync end - sync start)

To make TPROC-C replicate typical customer data volumes (based on observed averages from our service), we augmented the workload by adding additional detail columns to bring the average row width up to about 1KB. This is roughly the same approach used by Gigaom in their benchmark of Fivetran. 

Benchmark configuration details

Data set

Generator TPROC-C
Size on disk 971 GB (~1TB)
Row count 631,396,497
Average row width 1107 bytes
Table count 140

Source system (Oracle)

Instance GCP E2_HIGHMEM_16
vCPUs 16 cores
Memory 128 GB

Destination system (Snowflake)

Data warehouse size Snowflake XS

Benchmark results

When running the benchmark with this dataset, we can fully sync the source tables into Snowflake in 1:30:00 - 1:40:00, resulting in the following performance results: 

Fivetran has spent significant time in the past year improving our performance and ensuring that we are able to replicate large data volumes in small amounts of time. Some of the improvements include expanding parallelism in our core system and optimizing our data load pattern into common data warehouses. 

The biggest implication of this performance increase is faster time to value. Each additional data source that is added is ready to query for analytics, AI/ML operations, and reporting within minutes. Fivetran supports replication from 30+ databases into a variety of target systems. With continuous development into new connectors, faster throughput and lower latency, we ensure we can meet the needs of our customers giving them access to their data when and where they need it.

Stay tuned for a follow-up report detailing the improvements we’ve made to change data capture. 

[CTA_MODULE]

Experience Fivetran data pipelines for yourself with a free trial.
Sign up
Topics
No items found.
Share

Verwandte Beiträge

Cloud Data Warehouse-Benchmark
Data insights

Cloud Data Warehouse-Benchmark

Beitrag lesen
GigaOm benchmark report: Fivetran HVR vs Qlik Replicate
Data insights

GigaOm benchmark report: Fivetran HVR vs Qlik Replicate

Beitrag lesen
Monitor performance and control costs with the Fivetran Platform Connector
Data insights

Monitor performance and control costs with the Fivetran Platform Connector

Beitrag lesen
Five key attributes of a highly efficient data pipeline
Blog

Five key attributes of a highly efficient data pipeline

Beitrag lesen
Scale secure data movement with Hybrid Deployment and Kubernetes
Blog

Scale secure data movement with Hybrid Deployment and Kubernetes

Beitrag lesen
Hybrid solutions to modernize legacy systems in financial services
Blog

Hybrid solutions to modernize legacy systems in financial services

Beitrag lesen
Fivetran Product Update: February 2025
Blog

Fivetran Product Update: February 2025

Beitrag lesen

Kostenlos starten

Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.