Data pipeline state management: An underappreciated challenge

Use the Fivetran Connector SDK to focus on integrating data instead of infrastructure management.

November 17, 2025

Building reliable data pipelines requires not only extracting data and writing transformation logic, but also production-grade reliability, performance, and data integrity.

Reliability, performance, and data integrity, in turn, require managing state across failures, a challenge that requires assembling and operating sophisticated infrastructure.

What is state management?

In the context of data pipelines, “state” is a snapshot recording the last known increment of progress, specifically what data has been read, transformed, or written. State management enables incremental data synchronization by tracking the processing of records. Without states, every sync would be a full historical reload, wasting resources and time.

Depending on the constraints of the source system, state can be tracked using several methods.

Timestamp cursors track the latest timestamp seen (e.g., updated_at > '2025-01-15T10:30:00Z'). The next sync only queries records modified after that timestamp. However, timestamp-based cursors are complicated by potential inconsistencies such as timezones, precision (milliseconds vs. seconds), and commit time versus start time.

Sequence-based cursors track auto-incrementing IDs or log sequence numbers. These are more reliable than timestamps but may miss updates to existing records if the ID doesn't change. This method is best suited for append-only data.

Pagination tokens are provided by APIs (like Microsoft Graph or Salesforce) that abstract sync logic from consumers. After each sync, the API returns a token representing that moment's state. The next sync presents this token to receive only changes since it was issued.

Multi-table state becomes necessary when a connector syncs multiple tables or endpoints. A CRM connector tracking separate tables for accounts, opportunities, and contacts needs an independent state for each table, creating a complex state object that must be managed atomically.

Why you shouldn’t DIY state management

When building custom data pipelines from scratch, developers own the entire state lifecycle. This requirement introduces multiple, compounding challenges.

Infrastructure overhead

State must be durable. It cannot live in process memory because it will be lost in a crash, forcing a full re-sync. This requires:

Storage provisioning: Select and maintain a state storage system (S3, GCS, PostgreSQL, MySQL, Redis, DynamoDB). Each requires configuration, backups, monitoring, and scaling strategies. Additionally, you need state persistence – the ability to load each state, with error handling for missing state files.
Security and access control: Implement authentication and authorization for state storage, complicating credential management.
Serialization logic: Write boilerplate code to serialize complex, multi-table state objects into formats (e.g., JSON) that can be stored and sent anywhere and deserialize them reliably on retrieval.

All of this foundational infrastructure is necessary before any data extraction logic can be written.

Extraction and loading

Data extraction requires loops to fetch data from sources with cursor-based pagination. The pipeline needs to write records to memory buffers until batch size thresholds are met, while tracking the latest cursor values seen in each batch.

To load data, you will need to build destination-specific INSERT statements with conflict resolution (UPSERT logic) for each table's schema. You will need transaction management with proper commit and rollback logic, as well as manage database cursors and the connection lifecycle.

There is also the matter of orchestration — coordinating the sequential synchronization of multiple tables to ensure that hierarchical dependencies are respected, as well as ensuring proper connection cleanup and error propagation.

Failure handling and recovery

Production pipelines face failures at every stage – API rate limits, network or query failures, infrastructure stoppages, credential issues, resource limitations (e.g., memory leaks), and bugs of all kinds. The pipeline must implement robust retry logic that's tightly coupled with state management.

If a pipeline fails partway through a large dataset, it must:

Pause processing
Persist the intermediate state using any relevant cursors or pagination tokens
Retry
Resume within a reasonable distance of where it left off

Idempotence is key to handling failure and recovery; a pipeline must be able to identify unique records between the incoming batch and the destination. Without idempotence, retry mechanisms may lead to either costly resyncs or duplicates.

Atomicity coordination

Atomicity consists of treating a batch of operations as a discrete, indivisible unit, with either all or none of the operations being completed. To guarantee atomicity in the face of potential failures, the state must only be updated after the data in the corresponding batch has been successfully committed to the destination warehouse.

Suppose we’re in the midst of syncing batch #5 of many.

Scenario 1: Premature checkpointing causes data loss

Pipeline processes batch #5
Pipeline updates state to reflect batch #5 is complete
Pipeline attempts to write batch #5 to destination
Write fails (connectivity issue, schema violation)
Next sync reads state, assumes batch #5 succeeded, starts at batch #6
Some or all records from batch #5 are permanently lost

However, you can encounter problems updating the state after the corresponding data has been committed, too.

Scenario 2: Delayed or failed checkpointing causes duplication

Pipeline processes batch #5
Pipeline successfully writes batch #5 to destination
Pipeline attempts to update state
State update fails (script crashes, S3 connection lost)
Next sync reads state showing batch #4 as last successful point
Batch #5 is reprocessed, duplicating some or all batch #5 records

To solve scenario 2, you need idempotence to identify and avoid duplicating unique records.

Concurrency and locking

If a job runs for too long, the next might start before the first completes. Multiple instances of your pipeline might run simultaneously, especially using schedulers like Airflow or Kubernetes, causing them to:

Read the same initial state
Process the same data
Attempt conflicting state updates
Create data duplication and state corruption

While it is possible to address concurrency with distributed locking using tools like Zookeeper or Redis, it is usually better to avoid concurrency altogether.

Debugging

State management issues may cause failed syncs or incorrect data. Diagnosing and solving these issues requires the following steps:

Inspect production storage: Manually query the PostgreSQL table or download the JSON blob from S3.
Analyze historical logs: Sift through verbose application logs to trace state transitions.
Cross-reference systems: Compare timestamps and sequences across source API logs, pipeline logs, state store, and the destination warehouse.

This sleuthing is so troublesome that developers often resort to "state surgery," manually editing the production state object. This is risky; miscalculating the rewind point can result in additional data loss or corruption.

Fivetran Connector SDK makes state management trivial

The Fivetran Connector SDK handles state management with a simple Python dictionary passed directly to your connector function. The connector receives two parameters: a configuration dictionary containing secrets and credentials configured during deployment, and a state dictionary that is empty for the first sync and adds checkpoint states from subsequent runs.

Developers initialize cursors with fallback values for first-run scenarios, query the source, send records to Fivetran using upsert operations, update the cursor as records are processed, and finally checkpoint the state. Fivetran ensures atomicity of this checkpoint operation.

Multi-table state management

For connectors syncing multiple tables, developers can maintain a state dictionary with independent cursors for each table, such as timestamp cursors for users, sequence ID cursors for orders, and timestamp cursors for products. Each table is synced independently, updating its respective cursor, and all cursors are checkpointed together atomically at the end of the sync.

The state structure is flexible and developer-controlled. Use a single timestamp, multiple cursors per table, or nested structures for hierarchical entities, depending on your data source's requirements.

Connector SDK makes the power of the Fivetran platform accessible

The Connector SDK is supported by the core architecture of the Fivetran platform, offering not only code simplicity but also enterprise-grade operational benefits.

Atomic checkpointing with at-least-once delivery: When a checkpoint is called, Fivetran durably persists both the data sent since the last checkpoint and the new state together. This establishes a checkpoint boundary that guarantees at-least-once delivery. If a failure occurs during a sync (e.g., the connector crashes after sending data but before completing the checkpoint), the next sync resumes from the last successful checkpoint. This means that data may be reprocessed and resent. You must identify primary keys so Fivetran can perform idempotent upserts.

Automatic retry and recovery: If a sync fails, the next scheduled sync resumes from the last checkpoint without manual intervention. No need to implement complex retry logic.

Zero infrastructure management: No need to provision storage, manage credentials, or implement serialization. Fivetran handles state persistence in secure, managed storage.

Concurrency prevention: Fivetran's orchestration typically ensures only one sync runs per connector instance at a time. If a sync is already running when a scheduled sync should start, Fivetran postpones the new sync. This prevents the concurrent execution issues that plague DIY implementations.

Schedule management: Fivetran handles sync scheduling — no need to configure cron jobs or orchestration tools.

Monitoring and alerting: Built-in dashboards show sync status, data volume, errors, and state progression. You can set up alerts for failures.

Integrated observability: State may appear in logs and is retrievable through the REST API. You can inspect state history, correlate state changes with data loads, and debug issues using logs and the API.

Operational control: Use the Fivetran REST API to inspect and modify state when needed. No "state surgery" on production databases.

Local testing and debugging: The fivetran debug command runs connectors locally, automatically creating and managing a local state file in the project directory. Developers can inspect or manually edit this file to test different sync scenarios. The fivetran reset command clears local state to simulate a fresh sync.

The local state file uses standard JSON format with cursor values for each tracked entity, making it easy to understand and modify for testing purposes.

Scalability: Fivetran automatically provisions and scales compute resources. Your connector code runs in Fivetran's managed environment without infrastructure concerns.

Security and Compliance: Fivetran infrastructure is SOC 2 Type II compliant with encryption at rest and in transit for data. Configuration values (credentials) are encrypted. Remember that state itself is operational metadata that may appear in logs, so sensitive data should always go in configuration.

The upshot is that Fivetran radically reduces not only code volume but the entire operational burden. DIY implementations require dedicated infrastructure engineering, ongoing maintenance, and in-depth expertise in distributed systems.

Considerations for state management

To get the most out of the Connector SDK, observe the following habits:

Use descriptive state keys: Instead of generic keys like "cursor" or "timestamp", use descriptive names that clearly indicate what entity and aspect they track, such as "users_last_updated", "orders_max_id", or "products_sync_token".

Explicitly identify primary keys where possible: If you omit primary keys in your schema definition, Fivetran hashes all columns to create _fivetran_id. When new columns appear, this hash changes and can result in duplicate entries. Always define explicit primary keys for idempotent upserts and reliable data integrity.

Checkpoint regularly for long-running syncs: For syncs processing millions of records, checkpoint periodically (approximately every 10 minutes or 10,000 records) to minimize recovery time if a sync fails. Avoid excessive checkpointing (more frequently than once per minute), as this can introduce performance overhead. Implement counter-based logic that triggers checkpoints at regular intervals during large dataset processing.

Understand configuration vs. state security:

Configuration: Encrypted and secure. Use for secrets, API keys, credentials.
State: Treated as operational metadata, NOT encrypted, may appear in logs. Use for sync progress, cursors, timestamps.

Never mix these purposes. Always use configuration for sensitive data and state only for tracking sync progress.

Keep state size under 10MB: State has a 10MB limit. If you need to track large amounts of data, store it in your destination and reference it by ID. Instead of storing comprehensive lists of processed identifiers, track only the latest cursor position.

Focus on the last mile

State management is critical for reliable data pipelines, but it's undifferentiated infrastructure work. Building it from scratch is a substantial engineering effort requiring deep distributed systems expertise, ongoing maintenance and operational overhead, and the risk of subtle bugs that threaten data integrity.

By contrast, the Fivetran Connector SDK provides battle-tested state management as a platform service. Developers write extraction logic instead of provisioning and managing infrastructure. For teams building custom data connectors, the choice is clear: invest engineering effort in data extraction logic that differentiates your pipeline, not in rehashing solved problems.

Getting started

Ready to build robust custom connectors without all of the infrastructure complexity?

Explore the documentation: Visit the Fivetran Connector SDK documentation
Review examples: Browse the GitHub repository with 60+ working examples
Learn best practices: Read the Best Practices guide
Review the technical reference: Check the Technical Reference for detailed API documentation
Start building: Follow the Setup Guide to create your first connector
Get help: File a support ticket for free professional services assistance with your first connector

The SDK pairs well with AI coding assistants like Claude, Cursor, and GitHub Copilot, enabling the rapid development of connectors. Focus on your data and let Fivetran handle the rest.

[CTA_MODULE]

Data insights