Learn

How to load data from S3 to Redshift

October 2, 2025

Fivetran

Anchor Link

Fivetran

Topics

No items found.

For loading data from Amazon S3 to Redshift: automated ELT is simple and scalable, COPY is lightweight, and Glue excels at custom logic.

A modern data stack often depends on 2 core Amazon Web Services: Amazon S3 for its scalable object storage and Amazon Redshift for its massively parallel processing (MPP) data warehouse.

Transferring data between them is where the challenge lies.

A production-grade pipeline must do more than a one-time bulk load. It has to handle incremental updates to keep data fresh and manage schema drift without failing.

This article provides a technical comparison of the 3 primary methods for building that pipeline:

Using the native Redshift COPY command
Developing a process with AWS Glue
Deploying an automated ELT platform.

Each approach comes with distinct trade-offs in setup, maintenance, and scalability.

Method 1: The native COPY command

Redshift’s primary tool for bulk data ingestion from Amazon S3 is the COPY command. It uses the data warehouse’s massively parallel processing (MPP) architecture to load data at high speed.

When an engineer executes the command, the leader node of the cluster divides the workload among all compute nodes, which then pull data from S3 in parallel.

The command provides direct control over the loading process and supports multiple data formats, including CSV, JSON, Avro, and Parquet.

It also includes options for managing file compression and mapping source columns to different target columns. An IAM role attached to the Redshift cluster grants secure, temporary credentials for the connection to S3.

COPY command method overview

Setup complexity	Moderate: Requires writing the command, creating IAM roles, and configuring an orchestration tool.
Maintenance overhead	High: The engineer owns the entire pipeline: orchestration, monitoring, and manual intervention.
Incremental loading	Manual: Requires building a separate process to track file states and generate a manifest file for each batch.
Schema drift handling	Manual: The command fails on any schema mismatch. Requires manual ALTER TABLE commands and a re-run of the failed job.
Data transformation	Limited: Supports basic, in-flight column mapping. All complex logic must be handled pre- or post-load.
Ideal use case	Infrequent, large-scale bulk loads with a stable schema and available engineering resources.

Using the command effectively in production requires managing its core operational mechanics.

Targeted loading with manifest files: The COPY command uses manifest files to load files selectively.
- A manifest is a JSON-formatted file that contains the explicit S3 object path for every file to be loaded.
- The command cannot generate this file, so the engineer must build a separate, external process, such as a Lambda function or an Airflow DAG, to scan S3, identify new files, and produce the manifest for each batch.
Schema-on-write enforcement: The command operates on a strict schema-on-write model.
- The structure of the data in Amazon S3 must exactly match the column order and data types of the target Redshift table.
- Any deviation, such as a new column in the source file, will cause the command to fail. The data transfer halts until an engineer manually alters the table schema and re-executes the load.
System tables for error logging: An engineer must query the STL_LOAD_ERRORS system table to debug all ingestion failures.
- While the command’s MAXERROR parameter allows the process to complete by skipping a specified number of rows containing errors, this creates a risk of loading incomplete data.

For well-defined, infrequent bulk loads, the COPY command is a high-performance solution.

Its primary trade-off is the significant engineering overhead required to build and maintain the surrounding automation for file discovery, schema management, and monitoring.

Method 2: AWS Glue for custom ETL

When a pipeline requires in-flight data transformation, engineers turn to AWS Glue. Glue is a serverless platform designed to run extract, transform, and load (ETL) jobs.

An engineer can write a custom script that runs on a managed Apache Spark environment to clean, reshape, or enrich data as it moves between Amazon S3 and Redshift.

The service coordinates 3 components to execute this process. First, a crawler scans the source data in S3 to infer a schema. It registers that schema as a table in the AWS Glue Data Catalog, a central metadata repository. The ETL job then uses the catalog metadata to read the S3 data, apply the transformations defined in the script, and write the final result to Redshift.

AWS Glue method overview

Setup complexity	High. Requires configuring crawlers, the Data catalog, IAM roles, and developing a Spark ETL script.
Maintenance overhead	Moderate. Engineer maintains the ETL code, manages job schedules, and debugs failures. The infrastructure is managed.
Incremental loading	Automated. Uses job bookmarks to automatically track and process new files between scheduled runs.
Schema drift handling	Semi-automated. The crawler detects changes, but an engineer must manually update the ETL script to handle the new schema.
Data transformation	Post-load (ELT). Loads raw data first. All transformation is handled via SQL in the Redshift warehouse.
Ideal use case	Pipelines requiring complex, in-flight transformations and teams with Spark development expertise.

Using AWS Glue in production means managing the interplay between these components.

Automated schema discovery: The Glue crawler automates the initial work of defining a table structure.
- If the source data schema changes, the crawler must be re-run to update the Data Catalog.
- The ETL script itself may then also require manual updates to handle the new or altered fields.
Custom transformation with Spark: Glue’s primary purpose is to run custom ETL code, typically written in PySpark.
- This allows for complex business logic, data validation, and enrichment before the data is loaded.
Operational overhead costs. Engineers are responsible for developing, testing, and maintaining this code as business requirements change.
Incremental processing with job bookmarks: To automate incremental loads, Glue uses a feature called job bookmarks.
- When enabled, a bookmark records a timestamp after each successful run.
- On the next run, the job automatically processes only the files that have appeared in S3 since the last timestamp, which eliminates the need to build a manual file-tracking system.

AWS Glue offers fine-grained control over data transformation. It is a significant step up in complexity from the COPY command, trading its simplicity for this control at the cost of higher development and maintenance overhead.

Method 3: Automated ELT

The third method shifts the pipeline model from ETL to ELT (extract, load, transform).

This abstracts the underlying code and infrastructure entirely. Fully managed ELT platforms, like Fivetran, have pre-built S3 connectors that combine the extract and load stages into a single, automated process. All data transformation then occurs on top of the raw data after it lands in Redshift.

Automated ELT method overview

Setup complexity	Very low. A UI-based setup to provide S3 and Redshift credentials. No code is required.
Maintenance overhead	Very low. Fully managed by the service provider, including infrastructure, code, and monitoring.
Incremental loading	Fully automated. Continuously monitors for new or modified files and ingests them automatically.
Schema drift handling	Fully automated. Detects schema changes and automatically propagates them to the Redshift table without pipeline failure.
Data transformation	High control. Provides a fully programmable Spark environment for complex, in-flight data transformation.
Ideal use case	Teams prioritizing high pipeline reliability and minimal operational overhead.

This approach changes the operational contract. The engineer’s responsibility shifts from writing and maintaining pipeline code to configuring the connection and defining business logic in SQL.

Continuous file detection and processing: Instead of scheduled batch jobs, these platforms monitor S3 locations continuously.
- When a new or modified file appears, the service automatically validates, ingests, and commits the data to Redshift.
- This process tracks file state internally, removing the need for engineer-managed manifest files or job bookmarks.
Automated schema migration: The architecture of these platforms anticipates schema drift, treating it as a routine event.
- When a new column appears in a source file, Fivetran automatically issues an ALTER TABLE command to add the corresponding column in Redshift before loading the new data.
- Destructive or incompatible changes may require manual review.
- Data type changes are handled by casting the target column to a compatible supertype, allowing the data transfer to continue without interruption.
Managed execution and recovery: The service provider owns the responsibility for the pipeline's execution.
- If a file is corrupted or a network error interrupts a load, the platform's internal logic manages the retry attempts.
- The engineer is alerted to persistent issues, but the system handles the recovery process automatically.

An automated ELT platform is designed to produce a reliable data pipeline with minimal operational overhead. This approach trades the granular control of custom scripting for a significant reduction in development and maintenance time.

Comparison of loading methods

The following table directly compares the 3 methods against the core engineering requirements for a production data pipeline.

	COPY command	AWS Glue	Automated ELT platform
Setup complexity	Moderate	High	Low
Maintenance overhead	High	Moderate	Very low
Automated incremental loading		✔	✔
Schema drift handling	Manual	Partial	✔
Data transformation	Light	Pre-load (ETL)	Post-load (ELT)
Use case	Bulk loading	Custom ETL	Scalable ELT
Primary benefit	Bulk loading	Flexible logic	Minimal overhead

Common challenges: S3 to Amazon Redshift

A one-time data dump from S3 to Redshift is simple enough. Building that process into a reliable, production-level pipeline requires solving 3 recurring engineering problems to maintain performance and data integrity.

Incremental loading

Analytics requires fresh data, which makes reloading an entire Amazon S3 bucket for every update both inefficient and expensive. Production pipelines solve this with incremental loading, a process that identifies and ingests only new or modified files. Without an automated way to track these changes, compute costs escalate, and query performance suffers.

Schema drift

Source data schemas evolve. An application update might add a new column or change a data type, causing schema drift.

This breaks any pipeline hard-coded to a specific file structure and halts all data transfer. A production system has to anticipate this, either by detecting the changes and alerting an engineer, or by automatically propagating the new schema to the target Amazon Redshift table.

Error handling and data integrity

Pipelines sometimes fail. Source files may arrive corrupted, credentials can expire, and network issues can interrupt data transfers. Simple scripts cannot handle these exceptions and will cause data loss.

A production-ready pipeline accounts for this with built-in error handling, including detailed logging, alerts, and automated retries to protect data integrity in the warehouse.

Handling data formats and structure

The format of your source data in S3 has a major impact on your choice of ingestion method. While a structured CSV file is simple to load, semi-structured formats like JSON or columnar formats like Parquet require specific architectural solutions.

COPY command: This command handles semi-structured data in 2 ways. You can either
- Load an entire JSON object into a single SUPER data type column for querying inside Redshift, or
- Use a JSONPath file to manually shred specific JSON keys into a flat, relational table. (This is a manual process that breaks whenever the source JSON structure changes.)
AWS Glue: The service is designed for this type of complexity.
- The Glue crawler automatically infers the schema from nested JSON or Parquet files and registers it in the Glue Data Catalog.
- An engineer can then use this reliable schema to write a Spark ETL job that flattens the data into the precise tabular format required by Redshift.
Automated ELT platforms: These platforms handle semi-structured data automatically.
- Such a service parses JSON during ingestion, shredding nested objects and arrays into separate, relational tables in Redshift.
- It adds foreign keys to maintain relationships, removing the need for JSONPath files or custom ETL scripts.

Automate S3 to Redshift with Fivetran

The right method for moving data from S3 to Redshift depends on the engineering and resource trade-offs your team is willing to make.

For infrequent bulk loads ->

The COPY command offers direct control but requires a significant investment in building and maintaining automation.

For pipelines that need complex, in-flight transformation ->

AWS Glue provides a managed Spark environment at the cost of developing and managing custom ETL code.

When the priority is pipeline reliability with minimal operational overhead ->

An automated ELT platform is the most efficient approach, allowing engineers to focus on building data models instead of managing infrastructure.

Fivetran provides a fully automated S3 connector to move data into Redshift.

[CTA_MODULE]

Start your 14-day free trial with Fivetran today!

Get started now to build a new pipeline in minutes

Topics

No items found.

Heading

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Get demo

How to load data from S3 to Redshift

How to load data from S3 to Redshift

Method 1: The native COPY command

Method 2: AWS Glue for custom ETL

Method 3: Automated ELT

Comparison of loading methods

Common challenges: S3 to Amazon Redshift

Incremental loading

Schema drift

Error handling and data integrity

Handling data formats and structure

Automate S3 to Redshift with Fivetran

Related posts

Heading

Start for free