How effectively data pipelines move data depends on how they’re engineered. Less-efficient pipelines will increase your data stack costs and require your data engineers to do more work. That means it’s worthwhile understanding pipeline efficiency if you want to save your data team time and control infrastructure costs.

Specifically, look for a data pipeline that:

Normalizes data and provides well-designed schemas
Functions idempotently, automatically recovering from sync failures
Syncs data incrementally
Allows you to granularly de-select columns and tables
Enables programmatic management

NB: In this post, we’re focusing on data pipelines in operation as they extract data from a source and land it in a destination. Two additional factors influence overall pipeline efficiency: pipeline maintenance and data transformation capabilities. Consider choosing a data movement provider that automatically adapts to source changes and offers features that accelerate data transformation.

[CTA_MODULE]

1. Data normalization

Semi-structured data from a SaaS source like Marketo isn’t immediately usable for SQL-based data analysis. So what’s the best way to normalize it into a standard format and create a useful schema for data engineers and analysts?

A well-designed normalized schema provides nearly everything someone needs to know to work with a data set:

The tables represent business objects
The columns represent attributes of those objects
Well-named tables and columns are self-describing
The primary and foreign key constraints reveal the relationships between the business objects

Here’s a representative self-describing Fivetran schema for Marketo:

‍

Fivetran schemas are as close to third normal form (3NF) as possible. For SaaS API connectors, we’ve designed the schemas to represent the underlying relational data model in that source, which means data teams can automate extract-load into a normalized schema, freeing up a significant amount of data engineering time for higher-value activities.

Fivetran-enabled workflow with automated access to data and no waiting for data engineering time

Normalized schemas also create confidence that data is accurate, reducing back and forth between teams and redundant pipeline management.

Impact of normalization on warehouse costs

If you’re using warehouse-native data integration tools like AWS Glue or Azure Data Factory, you’ll have to normalize the landed data yourself, a process that eats up compute bandwidth and increases costs.

If you’re using a third-party data movement tool that offers normalization, it’s important to know whether the provider performs the normalization within its own systems or within your data warehouse or destination. If it’s the latter, it can get expensive.

At Fivetran, we normalize your data within our own VPC, so you’ll never have to worry about data ingestion processes devouring your warehousing compute bill. Here’s what it looked like when a current customer simultaneously used Fivetran and two other data integration providers to replicate the same data into the same warehouse; the other providers used the customer’s warehouse instance to normalize the data. (Figures are for monthly usage.)

DATA MOVEMENT PROVIDER	WAREHOUSE CREDITS USED	PERCENTAGE OF COMPUTE COST
Provider 1	143.69	61%
Provider 2	69.33	30%
Fivetran	6.74	3%

2. Idempotence

In the context of data movement, “idempotence” refers to the ability of a data pipeline to prevent the creation of duplicate data when syncs fail.

For the sake of efficiency, data warehouses load records in batches, so sync progress is recorded by the batch. When a sync is interrupted, it’s often impossible to pinpoint the individual record that was being processed at the time of failure; when the sync resumes, the pipeline starts at the beginning of the most recent batch, and some data is processed again.

With an idempotent process, by contrast, every unique record is properly identified and no duplication occurs. Fivetran engineers idempotence into every data pipeline, so failures don’t produce duplicate data.

3. Incremental syncs

Another key feature of an efficient data pipeline is the ability to update incrementally. While full syncs are necessary to capture all your data initially, they’re inappropriate for routine updates because they often take a long time, can bog down both the data source and the destination and consume excessive network bandwidth.

The difference in data volume between a full and incremental sync is usually one or two orders of magnitude, with resulting increases in data warehouse costs.

Make sure your data movement provider syncs data incrementally — and does so in a way that solves the common problems associated with incremental syncs. Otherwise, the inefficiencies caused by those problems will be passed along to your data team.

4. Data selection

Your team likely isn’t interested in analyzing the data in every single table an ELT provider syncs by default into your destination — and if those tables are large, they will significantly increase your data integration and warehousing costs. We commonly see customers looking to avoid syncing logging tables, notification tables and external data feeds that have little or no historical or analytical value to the end user.

Efficient ELT platforms will allow you to granularly de-select tables and columns that are unnecessary for analysis. Fivetran makes it easy for your team to do this:

‍

5. Programmatic pipeline management

The ability to manage data pipelines via an ELT provider’s API, not just its user interface (UI), can majorly reduce your team’s workload — and is essential for anyone trying to build data solutions at scale. The Fivetran API offers an efficient way to perform bulk actions, communicate with Fivetran from a different application, and automate human-led processes using codified logic. Data teams that can programmatically design workflows will have far more time for business-critical tasks and advanced data analysis.

For example, we’ve helped customers use the Fivetran API to create connectors to hundreds of databases — and in one instance over 1,000 — saving them many hours of work in the process.

Here’s an illustration of just one of the API use cases we support — data consultancy Untitled Firm using the Fivetran API to connect to its customers’ data and create powerful analytics applications:

Choose the most efficient pipeline for your team and business

Selecting an ELT tool or platform that provides efficiency in all the above areas will have a powerful cumulative effect on the performance and efficiency of your data team and stack. It will also have a nontrivial quality-of-life impact — data teams are just happier when they can focus on interesting, high-value work and forget about manual engineering chores or suddenly soaring warehouse costs.

Interested in learning more about how to optimize your data stack and make your team more efficient? Check out our recent ebook, How to choose the most cost-effective data pipeline for your business.

[CTA_MODULE]

Data insights

Five key attributes of a highly efficient data pipeline

July 17, 2023

Greta Xiong

Sales Engineer, Enterprise

Fivetran

Anchor Link

Greta Xiong

Sales Engineer, Enterprise

Fivetran

Topics

data pipeline

Not all ELT solutions are created equal. Here are the capabilities your tool needs to efficiently move data.

Specifically, look for a data pipeline that:

Normalizes data and provides well-designed schemas
Functions idempotently, automatically recovering from sync failures
Syncs data incrementally
Allows you to granularly de-select columns and tables
Enables programmatic management

[CTA_MODULE]

1. Data normalization

A well-designed normalized schema provides nearly everything someone needs to know to work with a data set:

The tables represent business objects
The columns represent attributes of those objects
Well-named tables and columns are self-describing
The primary and foreign key constraints reveal the relationships between the business objects

Here’s a representative self-describing Fivetran schema for Marketo:

‍

Fivetran-enabled workflow with automated access to data and no waiting for data engineering time

Normalized schemas also create confidence that data is accurate, reducing back and forth between teams and redundant pipeline management.

Impact of normalization on warehouse costs

DATA MOVEMENT PROVIDER	WAREHOUSE CREDITS USED	PERCENTAGE OF COMPUTE COST
Provider 1	143.69	61%
Provider 2	69.33	30%
Fivetran	6.74	3%

2. Idempotence

In the context of data movement, “idempotence” refers to the ability of a data pipeline to prevent the creation of duplicate data when syncs fail.

3. Incremental syncs

The difference in data volume between a full and incremental sync is usually one or two orders of magnitude, with resulting increases in data warehouse costs.

4. Data selection

Efficient ELT platforms will allow you to granularly de-select tables and columns that are unnecessary for analysis. Fivetran makes it easy for your team to do this:

‍

5. Programmatic pipeline management

For example, we’ve helped customers use the Fivetran API to create connectors to hundreds of databases — and in one instance over 1,000 — saving them many hours of work in the process.