What is ETL?
The acronym “ETL” stands for extract, transform, load. It is often used to refer to data integration: the activities required to support analytics. These activities include:
- Gathering and extracting data
- Loading it into a destination
- Transforming it into models that analysts can use
The precise order in which those activities are performed can vary. In this piece, we will describe two major conceptual approaches to data integration, ETL and ELT.
The traditional approach to data integration, Extract-Transform-Load (ETL), dates from the 1970s and is so ubiquitous that “ETL” is often used interchangeably with data integration. Under ETL, data pipelines extract data from sources, transform data into data models for analysts to turn into reports and dashboards, and then load data into a data warehouse.
Data transformations typically aggregate or summarize data, shrinking its overall volume. When ETL was first devised in the 1970s, most organizations operated under very stringent technological constraints. Storage, computation, and bandwidth were extremely scarce. By transforming before loading, ETL limits the volume of data that is warehoused. That preserves storage, computation, and bandwidth resources throughout the entire workflow.
The project workflow for ETL consists of the following steps:
- Identify desired data sources
- Scope the exact data-analytics needs the project is meant to solve
- Define the data model/schema that the analysts and other end users need
- Build the pipeline, consisting of extraction, transformation and loading functions
- Conduct analytics work and extract insights
Since, in ETL, extraction and transformation are both performed before any data is loaded to a destination, they are tightly coupled. Moreover, because transformations are dictated by the specific needs of analysts, every ETL pipeline is a complicated, custom-built solution.
The bespoke nature of these pipelines makes scaling very difficult, particularly adding data sources and data models.
There are two common conditions under which this workflow must be repeated:
- Upstream schemas change, invalidating the code used to transform the raw data into the desired data models. This happens when fields are added, deleted, or changed at the source.
- Downstream analytics needs change, requiring the transformation code to be rewritten to produce new data models. This usually happens when an analyst wants to build a dashboard or report that requires data in a configuration that doesn’t exist yet.
Any organization that is constantly improving its data literacy will regularly encounter these two conditions.
The close coupling between extraction and transformation means that transformation stoppages also prevent data from being loaded to the destination, creating downtime.
Using ETL tools for data integration therefore involves the following challenges:
- Constant maintenance – Since the data pipeline both extracts and transforms data, the moment upstream schemas change or downstream data models must be changed, the pipeline breaks and an often extensive revision of the ETL software’s code base is required.
- Customization and complexity – Data pipelines not only extract data but perform sophisticated transformations tailored to the specific analytics needs of the end users. This means a great deal of custom code.
- Labor-intensiveness and expense – Because the system runs on a bespoke code base, it requires a team of dedicated data engineers to build and maintain.
These challenges result from the key tradeoff made under ETL, which is to conserve computation and storage resources at the expense of labor.
Technology trends toward cloud data integration
Labor intensiveness was acceptable at a time when computation, storage, and bandwidth were extremely scarce and expensive, and the volume and variety of data were limited. ETL is a product of a time with very technological constraints from what currently prevail.
Specifically, the cost of storage has plummeted from nearly $1 million to a matter of cents per gigabyte (a factor of 50 million) over the course of four decades. Likewise, the cost of computation has shrunken by a factor of millions since the 1970s, and the cost of internet transit has fallen by a factor of thousands.
These trends have made ETL obsolete for most purposes in two ways. First, the affordability of computation, storage, and internet bandwidth has led to the explosive growth of the cloud and cloud-based services. As the cloud has grown, the volume, variety, and complexity of data have grown as well. A brittle, bespoke pipeline that integrates a limited volume and granularity of data is no longer sufficient.
Secondly, modern data integration technologies suffer fewer constraints on the volume of data to be stored and on the frequency of queries performed within a warehouse. The affordability of computation, storage, and internet bandwidth has made it practical to reorder the workflow needed to integrate data. Most importantly, organizations can now afford to store untransformed data in data warehouses.
What is ELT?
The ability to store untransformed data in data warehouses enables a new data integration architecture, Extract-Load-Transform (ELT), in which data is immediately loaded to a destination upon extraction, and the transformation step is moved to the end of the workflow.
This prevents the two failure states of ETL (i.e. changing upstream schemas and downstream data models) from impacting extraction and loading, leading to a simpler and more robust approach to data integration.
In contrast to ETL, the ELT workflow features a shorter cycle:
- Identify desired data sources
- Perform automated extraction and loading
- Scope the exact analytics needs the project is meant to solve
- Create data models by building transformations
- Conduct actual analytics work and extract insights
Under ELT, extracting and loading data are independent of transformation by virtue of being upstream of it. Although the transformation layer may still fail as upstream schemas or downstream data models change, these failures will not prevent data from being loaded into a destination.
Instead, even as transformations are periodically rewritten by analysts, an organization can continue to extract and load data. Since this data arrives at its destination with minimal alteration, it serves as a comprehensive, up-to-date source of truth.
Most importantly, the decoupling of extraction and loading from transformation means that the output of extraction and loading no longer must be customized. The destination can be populated with data directly from the source, with no more than light cleaning and normalization to ensure data quality and ease of use for analysts. Combined with the growth of the cloud, this means that extraction and loading can be:
- Outsourced to an outside party
- Scaled up and down as needed over the cloud
Automated extraction and loading produces a standardized output, allowing derivative products such as templated analytics products to be produced and layered on top of the destination.
Moreover, since transformations are performed within the data warehouse environment, there is no longer any need to design transformations through drag-and-drop transformation interfaces, write transformations using scripting languages such as Python, or build complex orchestrations between disparate data sources.
Instead, transformations can be written in SQL, the native language of most analysts. This shifts data integration from an IT- or engineer-centric activity to one that can be directly and easily owned by analysts.
Differences between ETL and ELT
The following table summarizes the differences between ETL and ELT:
There are some cases where ETL may still be preferable over ELT. These specifically include cases where:
- The desired data models are well-known and unlikely to change quickly. This is especially the case when an organization also builds and maintains systems that generate source data.
- There are stringent security and regulatory compliance requirements concerning the data, and it absolutely cannot be stored in any location that might be compromised.
These conditions tend to be characteristic of very large enterprises and organizations that specialize in software-as-a-service products. In such cases, it may make sense to use ELT for data integration with third-party SaaS products while retaining ETL to integrate in-house, proprietary data sources.
Benefits of ELT
An organization that combines automation with ELT stands to dramatically simplify its data integration workflow. A simplified data integration solution acts as a force multiplier to data engineering, enabling data engineers to focus on mission-critical projects such as optimizing an organization’s data infrastructure or productionizing predictive models, instead of constructing and maintaining data pipelines.
Analysts and data scientists, whose responsibilities often consist more of data integration activities than actual analytics, can finally leverage their understanding of business needs toward modeling and analyzing data instead of wrangling or munging it, or asking developers to wrange or munge on their behalf.