The principal metric for data engineering success is simple: Is data being used to make better organizational decisions? The infrastructure that enables such success, however, isn’t simple at all. Data engineers need to continuously centralize and maintain all the valuable data their organization generates, transform it into useful metrics, and make it accessible to analysts, data scientists, internal business users and customers.
According to a global survey sponsored by Fivetran and conducted by Dimensional Research, most data engineers lack the time and technological resources they need to meet that goal. For many organizations, informed decision-making remains out of reach. Two findings from the survey stand out:
- Nearly 50% of data engineers say valuable organizational data is not centralized for analysis
- Nearly 70% say they don’t have enough time to extract maximum value from existing data
The problem is data integration.
Data pipelines are hard to build and easy to break
Dimensional Research found that a substantial majority of data engineers are using script-based tools to pipeline data into a central repository, and many are still using spreadsheets. The problem with these and other common integration technologies is twofold: They’re difficult to build, and they break easily.
10+ days
Time it takes most engineers to build a script-based ETL solution
The survey found that nearly 70% of script-based solutions, and nearly half of spreadsheet-based solutions, took more than 10 days to build. Despite that time investment, most pipelines are not robust — they’re both failure-prone and difficult to repair. 51% percent of the engineers surveyed said their pipelines break daily, weekly or monthly, and well over half said it took more than one business day to repair a broken pipeline.
51% of data engineers
Say their data pipelines break daily, weekly or monthly
The problems are aggravated by the sheer number of data sources modern businesses use, and the frequency with which data from those sources needs to be moved. The survey found that 59% of companies use 11 or more data sources, and nearly a quarter use more than 50. 72% need to move data from sources to a destination more than once a day.
> 1 business day
Time it takes most engineers to repair a pipeline break
These ETL inefficiencies explain the two topline statistics we called out above. Building pipelines is so time-consuming that many data sources go unintegrated, and data engineers can’t adequately transform and model the data they have because they’re spending so much time on ETL.
Automated processes solve ETL inefficiencies
Automated data integration technology is designed to address inefficient pipeline construction and maintenance, while accelerating the process of data transformation and modeling. Here’s how it solves these core data engineering challenges.
Prebuilt pipelines that launch on demand
Automated data integration provides prebuilt data connectors for specific data sources, eliminating the burden of pipeline construction for data engineers. Vendors can carefully study the APIs and databases of individual data sources, and then build connectors that automatically normalize data and load it into analysis-ready schemas.
50% of companies
Need more than a business week to build a data pipeline
Maintenance-free data integration
In the Dimensional Research survey, engineers said that the top two causes of pipeline breaks were schema changes and source availability issues (connectivity, uptime, etc.). Automated data connectors can preclude those failures by detecting and responding to schema and API changes automatically. In the event of failure, they intelligently restart from the last point of success in the data delivery process.
Nearly 60% of companies
Suffer delays in critical decision-making when pipelines break
Accelerated data transformation and modeling
Prebuilt, automated pipelines free up data engineers to focus on modeling and transforming data. Integration with in-warehouse transformation tools can accelerate the process, allowing engineers to collaborate with other data professionals and enable them to build data models in a unified managed environment.
Engineers turn to innovation, customization and optimization
Replacing inefficient ETL processes gives data engineers much more time to focus on strategic projects, without decreasing their value in the market. There has been a shortage of data engineering talent since at least 2016, and demand for the role continues to grow. Businesses will continue to need data engineers — but not necessarily for ETL.
79% of companies
Plan to hire data engineers this year
Tristan Handy, CEO of Fishtown Analytics, creator of dbt, has observed the rise of automated data integration firsthand. Handy believes that data engineers are still a “critical part of any high-functioning data team,” as he wrote in a recent blog post. He went on to outline four main roles for data engineers who no longer have to worry about ETL:
- Managing and optimizing core data infrastructure
- Building and maintaining custom ingestion pipelines
- Supporting data team resources with design and performance optimization
- Building non-SQL transformation pipelines
At Fivetran, we’ve watched this scenario play out many times. When organizations adopt automated data integration, data engineering teams refocus on innovation, customization and optimization, with impressive results. Here are a few examples:
- Sendbird. The members of the data engineering team each saved 20 hours per month, and reinvested much of that time in building a data lake. They also began re-engineering product features to increase sales and customer retention.
- Square. Data engineers turned their focus from building and maintaining pipelines to building infrastructure for new products, including first-party solutions for business needs.
- Ignition Group. Freed of ETL tasks, the data warehouse team could more easily partner with other business units and help the company’s analysts by writing dimension views for its Snowflake warehouse.