Both data pipelines and ETL are responsible for transferring data between sources and storage solutions, but they do so in different ways. Data pipelines work with ongoing data streams in real time, while ETL focuses more on individual “batches” of data for more specific purposes. That’s not to say the two are entirely separate entities, however.
In this article, we’ll compare and contrast data pipeline vs. ETL, review real-world examples and explore areas where the two are related.
What is ETL?
Extract, transform, load (ETL) is one of the most common processes for preparing raw data for storage and later use. By “later use,” we mean any application that might use data for analytical processes or business intelligence — just to name a few.
Each part of the ETL process is a discrete step necessary for converting raw data into a useful format stored in a convenient location. True to the name, these steps involve extracting, transforming and loading data from a variety of data sources.
The ETL process
The ETL process does exactly what its name suggests.
Let’s look at each of these operations in closer detail.
- Extract: If you’re like most organizations, you probably pull data from a variety of external or on-premise data sources. Depending on your goals and applications, these sources might include existing data lakes or warehouses, sales and marketing applications, CRM systems, existing databases or even sensors and social media feeds.
Regardless of your sources (or even if you only have a single source), the ultimate goal of ETL is to put all your data in a single repository. The Extract process includes all of the individual extraction processes for each source, whether they be a full or partial extraction of certain data.
Extractions are almost always performed with an ETL service or ETL tools rather than hand-coded solutions. These tools are also useful when performing validations on extracted data, such as removing duplicates or checking data types.
- Transform: Just like crude oil, raw data is rarely useful on its own. To make it useful, we first need to transform it to remove “impurities” and put it in a usable format.
There’s no single way to do this: how you transform your data is entirely dependent on how you plan to use it. However, the process typically involves cleansing, mapping and ultimately transforming data using custom operations.
Data transformation is largely useful for ensuring data integrity.
For example, raw data might have integrity problems such as different spellings for the same person or place (e.g., “John” vs. “Jon”), multiple keys or account numbers for the same entity and so on. Validation processes help to fix these integrity problems by filtering and cleaning data according to set rules and conventions.
At the end of this process, raw data will have been transformed into a “clean” data set with consistent formatting, naming conventions and so on. Note that transformations aren’t always necessary, especially if your “raw” data is already suitable for your needs. Such data is often referred to as pass-through data, as it literally passes through the transformation stage.
- Load: Finally, we can safely load our squeaky-clean data into a destination repository. For most organizations, this is typically a data warehouse which might connect to individual data marts that deliver specific subsets of the data to specific applications (e.g., financial data for financial analysis, etc.).
However, the loading process isn’t always as straightforward as simply “moving” data from one place to another.
Since any real-time process requires large volumes of data to be loaded as quickly as possible, it’s crucial to optimize loading to maintain high performance. Beyond the initial load that populates all the data warehouse tables for the first time, organizations might use change data capture (CDC) for incremental loads that only load/update data that’s recently changed.
Loading also requires its own validation, such as checking that all key fields are populated. In the event of a load failure, the warehouse should be restored to the most recent stable point – which is all the more reason to maintain regular backups.
ETL vs. ELT
You’ve probably also heard of extract, load, transform (ELT), but what’s the difference?
While “ETL” and “ELT” are sometimes used interchangeably, this isn’t correct. Where ETL transforms before loading, ELT loads before transforming (hence the swap of “L” and “T”). It may seem like a small difference, but it’s actually a fairly large distinction.
In ETL, transforming the data before loading implies that you already know how you intend to use it. In other words, if you can always expect your applications and business processes to always use the same formats, then it’s usually safe to transform the data to the desired formats before loading it into your data warehouse.
In ELT, we make the opposite assumption. By loading the data into a data warehouse without transforming first, you’re providing flexibility for other applications to come in and transform it as they please.
Though that may sound disorganized, it’s quickly becoming the preference for large-scale predictive analytics that may use a variety of methods — and, as a result, a variety of formats — when analyzing data “on the fly.”
Extracting and loading first has also allowed modern data pipelines to focus solely on extract and load when gathering data from numerous sources. Thankfully, data integration platforms like Fivetran provide these features at a modular level, allowing data managers to place them anywhere in their ETL pipelines.
So what does all this mean in a nutshell? Keep these key points in mind as we start to explore more generic data pipelines in the next section.
- The ETL process (sometimes known as an ETL pipeline) is only responsible for extracting, loading and transforming data
- ETL works in batch processing, meaning that large “batches” of data (rather than a continuous stream) undergo ETL at a time
- ETL ends after loading data
What is a data pipeline?
Now that we understand how the ETL process works let’s explore a more general infrastructure for data integration: the data pipeline.
In essence, a data pipeline is a combination of the disparate sources, warehouse solutions, processes and application components that make up an organization’s data analytics infrastructure. In other words, it’s the literal pipeline through which data flows from source to destination.
While not every data pipeline is the same, almost all of them share the same major components.
Data pipeline components
Data pipelines consist of everything an organization needs to gather, store and process data for their needs. All of these components are divided into three major categories — sources, processing and destinations — which data encounters throughout the pipeline’s dataflow.
- Sources: Just like how oil pipelines start at oil sources, data pipelines start at data sources such as existing data warehouses or data lakes. While most pipelines have multiple data sources feeding into a single point, some pipeline architectures will assign different layers to different sources based on their priority (such as the “speed layer” found in lambda architecture).
- Processing: Once in the pipeline, data might be processed in multiple ways before arriving at their destination(s). Here we’ll often find the familiar ETL process, where each step might occur in a different order depending on whether the pipeline is predominantly ETL- or ELT-based.
- Destinations: Finally, data is sent to a final destination, such as data visualization or analytical tools. Though a data warehouse (sometimes known as a data sink in a data pipeline context) is usually the immediate destination after loading is complete, the pipeline ultimately extends to the individual applications for the data.
Of course, there’s much more to data pipelines than the three stages outlined above. Throughout the pipeline, data might also encounter various workflows and warehouse solutions depending on its ultimate use.
- Workflows: Many key functions are organized into workflows consisting of individual tasks known as jobs. Jobs flow from upstream (from the source) and eventually downstream (to the destination).
- Storage: Intermittent data storage solutions that might include data warehouses, data lakes or data marts (or a combination).
- Data Tools: Many organizations use data processing tools throughout their pipelines, especially where ETL processing is concerned.
- Analytics Tools: Most data pipelines “end” at analytics tools, where cleansed and formatted data are used to deliver actionable insights and data visualizations.
- Monitoring: With so many stages and components, data pipelines can’t exist as “black boxes” where only the source and destination are visible. As a result, most reliable data pipelines use some type of monitoring solution to track data at every point throughout the data flow.
The role of ETL in the data pipeline
ETL is an essential part of many data pipelines.
In fact, many data pipelines are organized around the ETL process, creating an ETL data pipeline like the one shown above. Though ETL is a batch process, most ETL pipelines can support real-time data analysis, becoming what’s known as a streaming data pipeline.
Before we move into our head-to-head comparison, let’s review a few key points:
- Data pipelines are responsible for data at every point from source to destination.
- Data pipelines operate continuously, often in real time.
- Data pipelines don’t end after loading data.
Data pipeline vs. ETL
Data pipelines and ETL are frequently confused with one another and for good reason: ETL is simply a type of data pipeline!
However, that doesn’t mean ETL shares all the characteristics of a “general” data pipeline or that ETL alone can support real-time applications. Keep these key differences in mind as you build your data integrations.
1. Data pipelines occur in real-time; ETL occurs in batches
Though ETL is a pipeline in itself, it only operates in discrete batches. To support real-time reporting and analytics, a more “complete” data pipeline solution is necessary to perform continuous batches (using ETL or ELT) to create real-time data movement.
2. Data pipelines don’t always transform data
As we mentioned earlier, many big data applications wait to perform transformation steps only when they’re needed.
Because of this, not every data pipeline is responsible for transforming data, whereas transformation must occur in ETL before loading into a data warehouse. Here, this difference often comes down to whether the pipeline is built around ELT or ETL.
3. ETL stops after loading data; data pipelines are continuous
ETL is only concerned with extracting, transforming and loading data — and nothing else. Even in an ETL pipeline, ETL is still a separate, discrete component that ends after loading. Once ETL is complete, data loaded into a data repository can continue throughout the rest of the pipeline.
In summary: data pipelines are continuous structures that might use ETL as a supporting process for extracting, transforming and loading data in frequent batches.
Get the best of both with Fivetran
Thankfully, you don’t need to choose between ETL and data pipelines. Not only can you build data pipelines around ETL, but you can also use data integration tools like Fivetran to create custom pipelines tailor-made to your needs and applications.
With powerful extract and load tools and a complete range of transformations, Fivetran can support ETL everywhere throughout your pipeline. Our even wider range of connectors, replication tools and other supporting features allow you to quickly and easily create reliable data pipelines for your data stack.