In today’s landscape, making smarter business decisions faster gives you a critical competitive advantage. But harnessing timely insights from your company’s data can seem like a headache-inducing challenge.
The volume of data — and data sources — continues to grow every day: on-premise solutions, SaaS applications, databases and other external data sources.
How do you bring the data from these disparate sources together? How do you process and transform this data so that it’s ready for analytics?
The answer is data pipelines.
In this article, we’ll look at what a data pipeline is, how it works, and how it differs from ETL pipelines. We’ll also explain the main data pipeline architectures and provide examples for each so that you can choose one that fits your data needs.
Table of contents
- What is a data pipeline?
- Data pipeline architecture
- Types of data pipeline architectures
- Data pipeline vs. ETL
- ETL and data pipeline reliability
- Automated data connectors
- Get started with a fully managed ELT data pipeline solution
What is a data pipeline?
As data moves from source to target systems, data pipelines include an additional step that transforms this data to make it ready for analytics. This facilitates a range of use cases, from creating sales forecasts to generating product recommendations and more.
Data pipelines consist of three components: a data source, a transformation step and a target destination.
Here’s a look at each data pipeline component:
- Source: A data source might include an internal database such as a production transactional database powered by MongoDB or PostgreSQL; a cloud platform such as Salesforce, Shopify or Mailchimp; or an external data source such as Nielsen or Qualtrics.
- Transformation: Data transformation can be performed using tools such as dbt or Trifacta, or can be built manually using a mix of technologies such as Python, Apache Airflow and similar tools. These tools are mostly used to make data from external sources relevant to each unique business use case.
- Destination: Destinations are the repositories in which your data is stored once extracted, such as data warehouses or data lakes.
Data pipelines centralize data from disparate sources into one place for analysis. You can get a more robust view of your customers, create consolidated financial dashboards and more.
For example, a company’s marketing and commerce stack might include separate platforms, such as Facebook Ads, Google Analytics and Shopify. If a customer experience analyst wants to make sense of these data points to understand the effectiveness of an ad, they’ll need a data pipeline to manage the transfer and normalization of data from these sources into a data warehouse such as Snowflake.
Additionally, data pipelines can feed data from a data warehouse or data lake into operational systems, such as a customer experience processing system like Qualtrics.
Data pipelines can also ensure consistent data quality, which is critical for reliable business intelligence.
Data pipeline architecture
A data pipeline architecture is the underlying system of a data pipeline — it’s responsible for capturing, transforming and routing raw data to destination systems.
Many companies are modernizing their data infrastructure by adopting cloud-native tools. Automated data pipelines are a key component of this modern data stack and enable businesses to embrace new data sources and improve business intelligence.
The modern data stack consists of:
- An automated data pipeline tool such as Fivetran
- A cloud data destination such as Snowflake, Databricks, Google BigQuery or Amazon Redshift
- A post-load transformation tool such as dbt (also known as data build tool, by dbt Labs)
- A business intelligence engine such as Looker, Chartio or Tableau
Data pipelines enable the transfer of data from a source platform to a destination, where analysts and data scientists can query and turn the data into actionable insights.
Consider the case of running shoe manufacturer ASICS. The company needed to integrate data from NetSuite and Salesforce Marketing Cloud into Snowflake to gain a 360° view of its customers.
To do so, the ASICS data team looked at its core application data — in this case, from the popular app Runkeeper — and combined data on signups for loyalty programs with data from other attribution channels. With a data pipeline, ASICS was able to scale its data integration easily.
There are many variations to the workflow above, depending on the business use case and the destination of choice.
Here are the basic steps of a data transfer.
1. Reading from a source
Sources can include production databases such as MySQL, MongoDB and PostgreSQL and web applications such as Salesforce and Mailchimp. A data pipeline reads from the API endpoint at scheduled intervals.
2. Defining a destination
Destinations might include a cloud data warehouse (Snowflake, Databricks Lakehouse, BigQuery or Redshift), a data lake or a business intelligence/dashboarding engine.
3. Transforming data
Data professionals need structured and accessible data that can be interpreted so that it makes sense to their business partners. Data transformation enables practitioners to alter and format data to make it relevant to their specific use case.
Data transformation can take the following shapes:
- Revising: Replacing incompatible characters, removing duplicate records, excluding incomplete data or removing unused columns
- Computing: Performing cross-column calculations, calculating averages and max values, turning row values into columns or sorting records
- Separating: Splitting a column into multiple columns or excluding data based on certain row values or columns
- Combining: Joining data across different tables or merging records from multiple tables
Transformations make data well-formed and well-organized — easy for humans and applications to interpret. A data analyst may use a tool such as dbt to standardize, sort, validate and verify the data brought in from the pipeline.
4. Storing data
The last step of a data transfer involves storing the data in a central repository — either in on-premise servers or in the cloud.
Two common types of datastores are:
- Data warehouses: Enterprise systems that can store structured data from different sources. They support use cases like data visualization and batch reporting.
- Data lakes: Repositories that can store structured and unstructured data. They support use cases for machine learning and predictive analytics (using historical data to predict future outcomes).
The datastore that you choose will depend on the data you’re storing and its intended business use case. Learn about the differences between data lakes and data warehouses here.
Types of data pipeline architectures
There are three types of data pipeline architectures: batch processing, stream processing and hybrid processing.
Here’s an overview of each.
Batch processing collects and processes large data sets in batches. It’s often used when companies don’t need to perform data analysis right away. Examples include processing payroll and creating weekly sales reports. Batching can be done in intervals ranging from minutes or hours to days or weeks.
Batch processing places a lot of load on data sources, so it’s often done outside normal business hours or when user activity is low. This data pipeline architecture is most commonly used with Extract, Transform, Load (ETL):
- Extract: Pulling data from a source
- Transform: Processing the data and making it usable
- Load: Moving the data to a repository (like a data warehouse)
Streaming data pipelines continuously move and transform data in real time right from the moment the data is created. This data comes from sources like Internet of Things (IoT) devices, applications, sensors, social media feeds and website logs.
Ride-sharing apps like Uber use stream processing pipelines to instantly calculate fares and connect riders to nearby drivers.
This data pipeline architecture is ideal for Extract, Load, Transform (ELT) processes, which offer more flexible data formats and fast data availability.
The Lambda architecture is a hybrid data pipeline architecture that combines batch and stream processing. It consists of three layers: a batch layer, a serving layer and a speed layer (also known as a “stream-processing layer”).
This data pipeline architecture is popular for big data applications, as it allows for both real-time streaming and historical batch analysis. However, it’s extremely complex to build. Administrators have to maintain two separate code bases and keep them in sync to ensure accurate results.
Data pipeline vs. ETL
To quickly recap, a data pipeline describes the process of moving data from one system to another. An ETL pipeline is a specific type of data pipeline. While the terms “data pipeline” and “ETL pipeline” are related, they’re not synonymous.
An ETL pipeline extracts data from a source, transforms it to make it useful and loads it into a destination. A common use case is moving data into a data warehouse for analytics.
There are three key differences between data pipelines and ETL pipelines.
1. Data pipelines don’t always involve data transformation
ETL pipelines always transform data before loading it into the target source. This isn’t always the case for data pipelines, as they can transform data after it loads into a destination (or not at all).
2. Data pipelines don’t always run in batches
Data pipelines often perform real-time data processing, which supports analytics and reporting. In contrast, ETL pipelines move data to a target system in pre-determined batches.
3. Data pipelines don’t always finish after loading data
ETL pipelines end after loading data into a target system. However, data pipelines can initiate other processes even after they finish loading data.
ETL pipelines seem like the obvious choice in terms of building a robust data pipeline architecture, as they allow you to transform and run your data in batches. However, they’re not always reliable, as we’ll see in the next section.
ETL and data pipeline reliability
As with anything in the technology world, things break, data flows included.
When your data analytics and business intelligence operations rely on data extracted from various sources, you want your data pipelines to be fast and reliable. But when you’re ingesting external sources such as Stripe, Salesforce or Shopify, API changes may result in deleted fields and broken data flows.
Moreover, building a data pipeline is often beyond the technical capabilities (or desires) of analysts. It typically necessitates the close involvement of IT and engineering talent, along with bespoke code to extract and transform each source of data.
Data pipelines demand maintenance and attention in a manner akin to leaking pipes — pipes down which companies pour money — with little to show in return. And don’t even think about the complexity of building an idempotent data pipeline.
With the rapid growth of cloud-based options and the plummeting cost of cloud-based computation and storage, there is little reason to continue this practice. Today, it’s possible to maintain massive amounts of data in the cloud at a low cost and use a SaaS data pipeline tool to improve and simplify your data analytics.
Automated data connectors
Your data engineering team can undoubtedly build connectors to extract data from a variety of platforms. Cost varies across regions and salary scales, but you can make some quick calculations and decide if the effort and risk are worth it.
Data engineers would rather focus on higher-level projects than moving data from point A to point B, not to mention maintaining those “leaky pipes” mentioned above.
Compare the effort of manually building connectors to an automated data pipeline tool. This kind of tool monitors data sources for changes of any kind and can automatically adjust the data integration process without involving developers.
This is why automated data connectors are the most effective way to reduce the programmer burden and enable data analysts and data scientists.
And with data transfer (or data pipelining) handled, data engineers are free to play a more valuable, interesting role: to catalog data for internal stakeholders and be the bridge between analysis and data science.
Get started with a fully managed ELT data pipeline solution
Fivetran automated data connectors are prebuilt and preconfigured and support 150+ data sources, including databases, cloud services and applications. Fivetran connectors automatically adapt as vendors make changes to schemas by adding or removing columns, changing a data element’s types, or adding new tables.
Lastly, our pipelines manage normalization and create ready-to-query data assets for your enterprise that are fault-tolerant and auto-recovering in case of failure.
Start a 14-day free trial today to get started with a fully managed ELT solution — access hundreds of data connectors, centralize volumes of data into any destination and speed up your analytics.
CTA: Learn why more companies are moving towards ELT
Button: Watch the webinar now