How effectively data pipelines move data depends on how they’re engineered. Less-efficient pipelines will increase your data stack costs and require your data engineers to do more work. That means it’s worthwhile understanding pipeline efficiency if you want to save your data team time and control infrastructure costs.
Specifically, look for a data pipeline that:
- Normalizes data and provides well-designed schemas
- Functions idempotently, automatically recovering from sync failures
- Syncs data incrementally
- Allows you to granularly de-select columns and tables
- Enables programmatic management
NB: In this post, we’re focusing on data pipelines in operation as they extract data from a source and land it in a destination. Two additional factors influence overall pipeline efficiency: pipeline maintenance and data transformation capabilities. Consider choosing a data movement provider that automatically adapts to source changes and offers features that accelerate data transformation.
1. Data normalization
Semi-structured data from a SaaS source like Marketo isn’t immediately usable for SQL-based data analysis. So what’s the best way to normalize it into a standard format and create a useful schema for data engineers and analysts?
A well-designed normalized schema provides nearly everything someone needs to know to work with a data set:
- The tables represent business objects
- The columns represent attributes of those objects
- Well-named tables and columns are self-describing
- The primary and foreign key constraints reveal the relationships between the business objects
Here’s a representative self-describing Fivetran schema for Marketo:
Fivetran schemas are as close to third normal form (3NF) as possible. For SaaS API connectors, we’ve designed the schemas to represent the underlying relational data model in that source, which means data teams can automate extract-load into a normalized schema, freeing up a significant amount of data engineering time for higher-value activities.
Fivetran-enabled workflow with automated access to data and no waiting for data engineering time
Normalized schemas also create confidence that data is accurate, reducing back and forth between teams and redundant pipeline management.
Impact of normalization on warehouse costs
If you’re using warehouse-native data integration tools like AWS Glue or Azure Data Factory, you’ll have to normalize the landed data yourself, a process that eats up compute bandwidth and increases costs.
If you’re using a third-party data movement tool that offers normalization, it’s important to know whether the provider performs the normalization within its own systems or within your data warehouse or destination. If it’s the latter, it can get expensive.
At Fivetran, we normalize your data within our own VPC, so you’ll never have to worry about data ingestion processes devouring your warehousing compute bill. Here’s what it looked like when a current customer simultaneously used Fivetran and two other data integration providers to replicate the same data into the same warehouse; the other providers used the customer’s warehouse instance to normalize the data. (Figures are for monthly usage.)
In the context of data movement, “idempotence” refers to the ability of a data pipeline to prevent the creation of duplicate data when syncs fail.
For the sake of efficiency, data warehouses load records in batches, so sync progress is recorded by the batch. When a sync is interrupted, it’s often impossible to pinpoint the individual record that was being processed at the time of failure; when the sync resumes, the pipeline starts at the beginning of the most recent batch, and some data is processed again.
With an idempotent process, by contrast, every unique record is properly identified and no duplication occurs. Fivetran engineers idempotence into every data pipeline, so failures don’t produce duplicate data.
3. Incremental syncs
Another key feature of an efficient data pipeline is the ability to update incrementally. While full syncs are necessary to capture all your data initially, they’re inappropriate for routine updates because they often take a long time, can bog down both the data source and the destination and consume excessive network bandwidth.
The difference in data volume between a full and incremental sync is usually one or two orders of magnitude, with resulting increases in data warehouse costs.
Make sure your data movement provider syncs data incrementally — and does so in a way that solves the common problems associated with incremental syncs. Otherwise, the inefficiencies caused by those problems will be passed along to your data team.
4. Data selection
Your team likely isn’t interested in analyzing the data in every single table an ELT provider syncs by default into your destination — and if those tables are large, they will significantly increase your data integration and warehousing costs. We commonly see customers looking to avoid syncing logging tables, notification tables and external data feeds that have little or no historical or analytical value to the end user.
Efficient ELT platforms will allow you to granularly de-select tables and columns that are unnecessary for analysis. Fivetran makes it easy for your team to do this:
5. Programmatic pipeline management
The ability to manage data pipelines via an ELT provider’s API, not just its user interface (UI), can majorly reduce your team’s workload — and is essential for anyone trying to build data solutions at scale. The Fivetran API offers an efficient way to perform bulk actions, communicate with Fivetran from a different application, and automate human-led processes using codified logic. Data teams that can programmatically design workflows will have far more time for business-critical tasks and advanced data analysis.
For example, we’ve helped customers use the Fivetran API to create connectors to hundreds of databases — and in one instance over 1,000 — saving them many hours of work in the process.
Here’s an illustration of just one of the API use cases we support — data consultancy Untitled Firm using the Fivetran API to connect to its customers’ data and create powerful analytics applications:
Choose the most efficient pipeline for your team and business
Selecting an ELT tool or platform that provides efficiency in all the above areas will have a powerful cumulative effect on the performance and efficiency of your data team and stack. It will also have a nontrivial quality-of-life impact — data teams are just happier when they can focus on interesting, high-value work and forget about manual engineering chores or suddenly soaring warehouse costs.
Interested in learning more about how to optimize your data stack and make your team more efficient? Check out our recent ebook, How to choose the most cost-effective data pipeline for your business.