Data pipeline architecture: A complete guide
Introduction
Poor data pipeline architecture can lead to operational failures during routine events, such as API updates, schema drift from source systems, or fluctuating data volumes. These predictable variances trigger cascading job failures, requiring manual intervention from the data team. This instability consumes engineering resources that should be allocated to development, rather than repairs.
This flawed architecture hinders business adaptation. Data teams struggle to integrate new sources or support analytics without major re-engineering, leading to delayed insights, missed SLAs, and a lack of data-driven confidence.
This guide provides a strategic framework for designing and building reliable data pipelines. We cover the core principles of any scalable data architecture. We will then detail the 3 primary patterns for modern data movement: batch ELT, stream processing, and change data capture.
Core principles of modern data pipeline architecture
Four principles define a modern data pipeline architecture:
- Scalability
- Reliability
- Modularity
- Observability
They are foundational design choices that determine a pipeline's ability to deliver accurate data on schedule, adapt to new business requirements, and operate without constant manual intervention. A data pipeline that neglects these principles is operationally expensive from its first execution.
Scalability
A scalable data pipeline maintains performance as data volume, velocity, and variety increase. The central architectural challenge is to design a system that handles both average and peak loads without failure or significant latency.
A data pipeline that performs well during routine operations but collapses during a sudden traffic spike is an operational liability. Scalability requires the decoupling of compute and storage resources, while a traditional, coupled architecture forces you to scale both components together.
The modern approach uses a cloud data warehouse or data lakehouse where processing power scales independently of the data storage layer. This architecture can apply immense compute resources for heavy transformations and then scale to 0, all while the data remains in cost-effective storage.
Key considerations:
- Use managed, serverless components: Use cloud-native services like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage for data storage, as they scale automatically. For compute, use platforms like Snowflake, Databricks, or BigQuery that provision resources on demand.
- Design for parallel execution: Structure data ingestion and transformation jobs as a series of smaller, independent tasks that can run in parallel. Distributed computing frameworks like Apache Spark execute these tasks across a cluster of machines to process enormous datasets efficiently.
Reliability
A reliable, fault-tolerant data pipeline anticipates and handles failures without data loss or corruption. The goal is to build a system that recovers from common issues like network outages, API unavailability, or malformed records, rather than preventing every error.
A pipeline that requires a complete restart after a single failed task is unreliable. Idempotency is the architectural foundation of reliability. An idempotent operation produces the same result regardless of how many times it is performed with the same input.
Designing pipeline tasks to be idempotent allows for automated recovery procedures. If a job fails midway through, an automatic retry will not create duplicate records or corrupt the final dataset.
Key considerations:
- Implement automated retries with exponential backoff. For transient issues, the pipeline should not fail immediately. It should automatically retry the operation after a short delay and gradually increase the delay between subsequent retries. This strategy avoids overwhelming a recovering service.
- Use a dead-letter queue (DLQ) for invalid records. When a pipeline encounters a record it can't process, it shouldn’t halt the entire batch. Instead, it should route the problematic record to a DLQ for later inspection. This pattern isolates failures and keeps the primary data flow operating.
Modularity
A modular data pipeline is composed of independent, loosely coupled components. Monolithic pipelines are difficult to maintain because their components are tightly integrated.
A small change to a data source schema can require a complete redeployment of the entire data flow. This dependency increases risk and slows development. A modular architecture treats distinct data pipeline stages, like data ingestion and transformation, as separate, reusable services.
Each module has a single responsibility and communicates with other modules through a well-defined interface, such as a message queue or a storage layer. This separation enables teams to upgrade or replace 1 component without affecting the rest of the pipeline.
Key considerations:
- Decouple components with a message bus or data lake: An ingestion module should write raw data to a storage layer. A separate transformation module then activates when new data arrives. This event-driven workflow reduces dependencies between components.
- Use a dedicated orchestrator for workflow management: Tools like Apache Airflow or Dagster should only manage the dependencies and execution order of tasks. The business logic for transformations must remain encapsulated within distinct, modular components that can be tested and versioned independently.
Observability
An observable data pipeline reveals its internal state through its external outputs. While monitoring tracks known failure modes, observability enables engineers to diagnose unknown problems.
When data is late or incorrect, an observable system provides the detailed telemetry needed to quickly identify the root cause, without new logging or redeployment. A complete observability practice requires 3 types of telemetry data: logs, metrics, and traces.
- Logs provide granular, event-level records of actions inside each component.
- Metrics provide a high-level, quantitative view of pipeline health, including the number of records processed per minute and data latency.
- Traces connect events across multiple components into a single view, so an engineer can follow a piece of data through its entire lifecycle.
Key considerations:
- Implement structured logging: All data pipeline components must emit logs in a consistent, machine-readable format, such as JSON. These logs should include context such as the task name and a unique run ID to make the data easy to search and analyze in a central platform.
- Track data lineage: Data lineage provides a complete audit trail of how data moves and transforms from its source to its destination. Lineage is essential for impact analysis. It shows which downstream dashboards will be affected by an upstream change and helps trace a data quality issue back to its origin.
Data pipeline architecture patterns
Choosing the correct architectural pattern is the most critical design decision a data engineer makes. Each pattern is a blueprint optimized for a specific use case and its data timeliness requirements. The 3 dominant patterns are batch ELT, stream processing, and change data capture.
Pattern 1: The batch ELT pipeline
The batch ELT (Extract, Load, Transform) data pipeline is the dominant architecture for modern analytics. This pattern processes data in discrete, scheduled intervals to move it from disparate source systems into a central cloud data warehouse.
It supports business intelligence, reporting, and analytics. The architectural model separates data extraction and loading from data transformation. This separation defines the modern data stack.
Primary use case: Business intelligence and analytics
Batch ELT architecture is the standard for any analytics use case that does not require real-time data. The pattern builds a comprehensive, historical view of business operations for complex analysis that can't run against production systems. Teams use this historical view to track key performance indicators over time.
Common applications include:
- Financial reporting and forecasting. Finance teams require a complete historical record of transactions to close the books and model revenue. Batch pipelines deliver this data reliably on a daily or hourly schedule.
- Marketing attribution. Marketing teams join data from multiple platforms like Google Ads, Facebook Ads, and Salesforce to understand campaign performance. A batch ELT data pipeline centralizes this data for transformation into a single attribution model.
- Product analytics. Product teams analyze user behavior to understand feature adoption and engagement. Batch pipelines collect event data from applications and load it into a warehouse for segmentation, funnel analysis, and cohort analysis.
Architectural flow
The ELT architecture is a 3-stage process. Each stage is functionally distinct and uses the capabilities of the modern cloud data warehouse. These platforms provide cost-effective storage and highly scalable, independent compute.
- Extract
In the first stage, an automated data movement platform extracts raw data from a wide variety of data sources. These sources include SaaS application APIs, transactional databases like PostgreSQL, and files from SFTP servers. The platform extracts data with minimal manipulation to ensure that the raw record is an exact copy of the source. - Load
In the second stage, the platform loads the raw data directly into a cloud data warehouse, such as Snowflake, BigQuery, or Databricks. This step intentionally bypasses any in-flight data transformation. Loading raw data first greatly simplifies data movement.
It isolates the complex, error-prone step of transformation from the critical path of data ingestion, making the pipeline more reliable. This method also guarantees a complete, untransformed copy of the source data remains available in the warehouse for re-processing or new analytics projects. - Transform
In the final stage, data engineers and analysts transform the raw data into clean, reliable, and analysis-ready data models. They use the massive, scalable compute engine of the warehouse itself to execute these transformations.
For most data modeling, they use SQL-based tools like dbt to clean, join, and aggregate data into analytical structures, such as star schemas. For more complex operations, such as statistical analysis or machine learning, engineers use Python. Frameworks like Snowpark run Python code directly on the warehouse compute engine.
Benefits and trade-offs
The batch ELT architecture provides significant reliability and flexibility. Decoupling transformation logic from the extraction and loading process creates a more resilient pipeline. If a transformation job fails, it does not affect data ingestion.
The raw data continues to load on schedule, and engineers debug and re-run the transformation independently. This separation gives technical teams broad access to raw data. Analysts build and refine new data models as business requirements change, while data scientists use the same raw data for ad-hoc exploration and feature engineering.
The main trade-off of this pattern is data latency. The data in the warehouse is only as current as the last successful batch run. These jobs typically execute on an hourly or daily schedule, so executive dashboards will not reflect events that happened minutes ago.
This latency makes the batch ELT architecture unsuitable for use cases that require real-time data, such as operational fraud detection or application monitoring.
Pattern 2: The stream processing data pipeline
A stream processing data pipeline processes data continuously, event by event, as it’s generated. This architecture analyzes and transforms data while it is in motion. Stream processing powers real-time applications that require an immediate, automated response to new information, with latency measured in seconds or milliseconds.
Primary use case: Real-time operational applications
Stream processing is for operational use cases where the value of data is highest in the moments after its creation. The pattern triggers immediate actions, updates live dashboards, and feeds real-time machine learning models. The goal is to influence an outcome while the event is still in progress.
Common applications include:
- Real-time fraud detection: Financial services companies analyze streams of transaction data to identify and block fraudulent payments. The system detects suspicious patterns and prevents transactions from being completed.
- Log and metric analysis: DevOps teams ingest and analyze high-volume streams of application and server logs for real-time anomaly detection and identification. This helps teams identify and respond to system failures or security threats as they happen.
- Supply chain logistics: A logistics company tracks its fleet of vehicles in real time. The system processes GPS data to reroute deliveries based on live traffic information.
Architectural flow
A stream processing architecture is a continuous, 3-stage flow. Each component is architected for high throughput and low latency to ensure data moves through the system without delay.
- Ingestion
Sources publish individual events, such as application clickstreams or database updates, to a distributed message bus like Apache Kafka or AWS Kinesis. The message bus organizes events into logical streams, known as topics. The system further divides topics into partitions to enable parallel processing by downstream consumers.
This design decouples the data producers from the data consumers. A producer publishes events without waiting for the processing application to be ready. The message bus also provides persistence, allowing consumers to replay events in the event of a failure.
- Processing
A stream processing engine, such as Apache Flink, Kafka Streams, or Spark Streaming, consumes the raw events from the message bus. The engine processes each event as it arrives. It applies stateless transformations, like masking a sensitive data field, that require no context beyond the individual event.
The engine also performs stateful processing, where it maintains memory of past events. It uses functions like tumbling windows to aggregate data in fixed, non-overlapping time intervals, such as a 1-minute summary. Sliding windows operate on continuously overlapping intervals to provide a constantly updated view of recent activity. - Destination (Sink)
After processing, the engine sends the transformed, real-time results to a destination system, or sink. The sink is an application that acts on the insight immediately. Common destinations include real-time dashboards, alerting services, and low-latency key-value stores like Redis.
These sinks often power other applications. A processed event might update a user's profile in a database.
An e-commerce front end then queries that database to personalize the current shopping session. Another common sink is a real-time machine learning inference endpoint that scores incoming data and returns an immediate prediction.
Benefits and trade-offs
Stream processing provides extremely low latency. This gives organizations the event-driven capabilities required by operational systems. The pattern enables automated, programmatic actions that are impossible with a batch-oriented design.
Streaming data pipelines are significantly more complex than batch processing. Designing and maintaining a fault-tolerant, stateful streaming data pipeline presents a major engineering challenge. State management, in particular, requires careful design to ensure correct data processing during a system failure.
Engineers must also solve complex problems in distributed systems. Out-of-order data, where events arrive late due to network latency, will corrupt time-windowed calculations if not handled correctly. Guaranteeing exactly-once processing semantics requires complex coordination to ensure each event is processed once, even with network failures and retries.
This operational overhead makes stream processing a poor fit for the flexible, large-scale historical analysis at which batch ELT excels.
Pattern 3: The change data capture pipeline
Traditional database replication relies on inefficient, high-impact batch queries. These queries repeatedly scan production tables for new or updated rows, consuming CPU and I/O resources. This often misses deleted records and can generate inaccurate data in the destination.
Change Data Capture (CDC) is a modern architectural pattern that solves these problems. A CDC data pipeline reads the internal transaction log of a source database to capture every row-level change as a structured stream of events. This pattern provides an efficient, low-impact, and near real-time method for database replication.
Primary use case: Efficient database replication
CDC is the superior architectural pattern for keeping a destination system, such as a cloud data warehouse, in continuous sync with a production OLTP database. This supports operational analytics that require data freshness measured in minutes, not hours. The goal is to provide a near real-time view of the business state.
Common applications include:
- Live inventory dashboards: An e-commerce company uses CDC to replicate its inventory database to a data warehouse. This powers a live dashboard that business users consult to monitor stock levels and make immediate pricing decisions.
- Customer support applications: A support team uses a dashboard that reflects the most recent customer orders. CDC provides the up-to-the-minute data from the production orders database that this dashboard requires.
- Microservice data synchronization: In a microservices architecture, CDC pipelines sync data between the dedicated databases of different services. This maintains data consistency across the operational landscape without creating tight coupling between the services.
Architectural flow
The CDC process uses a non-intrusive, log-based architecture. It treats the database's internal transaction log as the guaranteed, ordered source of truth for every change.
- The transaction log
All ACID-compliant databases use a transaction log to ensure data durability and for crash recovery. When a transaction commits a change, the database first writes a record of that change to its log before updating the data tables. The log contains the complete sequence of every successful modification in the database.
Databases use different names for this component. PostgreSQL uses the write-ahead log (WAL), MySQL uses the binary log (binlog), and Oracle uses the redo log. The database itself relies on this log for its core operations, making it the most reliable source of information about changes. - The log reader process
A specialized log reader process connects to the source database with specific replication permissions. This reader continuously tails the transaction log and parses the low-level, often binary, log entries. It translates these entries into a structured stream of change events, with each event representing a single INSERT, UPDATE, or DELETE.
The process does not execute queries against the production tables. It simply reads a file that the database already produces for its own internal purposes. This non-intrusive design avoids I/O contention and performance degradation on the source system. - Delivery and merging
The CDC tool transmits the stream of change events to a destination system, such as a cloud data warehouse. The events arrive in the same order as they were committed in the source database. An automated process at the destination consumes this stream and replays the changes into a target table.
This process typically uses a MERGE or UPSERT command. These commands efficiently apply the inserts, updates, and deletes to the target table to maintain transactional consistency with the source.
Benefits and trade-offs
CDC provides major advantages over traditional batch processing: minimal source impact, complete data fidelity, and low data latency.
Minimal source impact: Batch queries consume production I/O and CPU cycles that the primary application needs. Log-based CDC avoids this performance degradation and has a negligible impact on the source.
High data fidelity: Batch methods that use timestamps fail to capture DELETE operations. CDC captures every DELETE from the transaction log, so the destination remains an exact mirror of the source.
Low latency: CDC data pipelines operate in near real time. Data arrives minutes after the source transaction, providing much fresher data than hourly batch processing.
CDC is complex to configure and maintain. Setup requires elevated database permissions for logical decoding, and engineers must manage log retention policies. A CDC data pipeline must also handle schema evolution, like adding new columns, to prevent replication failures.
Data pipeline architecture comparison
Choosing the correct pattern requires a clear understanding of the trade-offs between data latency, architectural complexity, and the primary use case. The following table provides a direct comparison of the 3 dominant patterns.
How to choose the right data pipeline architecture
Answer the following questions to select the correct architectural pattern for your use case.
- What is your data latency requirement?
For historical analysis that uses data hours or days old, use batch ELT. For near real-time data from a database with latency measured in minutes, use CDC. A system that must react in seconds or milliseconds requires stream processing.
- Is the primary use case analytical or operational?
Batch ELT is built for the flexible queries required by business intelligence and reporting. An operational use case that triggers an immediate, programmatic action, like blocking a transaction, requires stream processing.
- Are you replicating a production database?
To maintain an accurate copy of a transactional database in a data warehouse, use CDC. The pattern is more efficient and has a lower impact on the source system than any batch-based alternative.
- What is your team’s tolerance for operational complexity?
Batch ELT pipelines are the least complex to build and maintain, especially with an automated data movement platform. Stream processing architectures are the most complex and require specialized skills in distributed systems. CDC setup is complex, but its maintenance is often lower than a custom streaming application.
Build your pipeline on a strong foundation with Fivetran
The ideal data pipeline must strike a balance between latency, performance, and specific business needs.
All 3 need a reliable foundation.
Fivetran provides an automated data movement platform for modern ELT and CDC pipelines. The platform manages the complex and unreliable process of data extraction and loading. Engineering teams use Fivetran to focus on building high-value analytics and data products.
[CTA_MODULE]
Verwandte Beiträge
Kostenlos starten
Schließen auch Sie sich den Tausenden von Unternehmen an, die ihre Daten mithilfe von Fivetran zentralisieren und transformieren.

