How to load data from S3 to Redshift
How to load data from S3 to Redshift

A modern data stack often depends on 2 core Amazon Web Services: Amazon S3 for its scalable object storage and Amazon Redshift for its massively parallel processing (MPP) data warehouse.
Transferring data between them is where the challenge lies.
A production-grade pipeline must do more than a one-time bulk load. It has to handle incremental updates to keep data fresh and manage schema drift without failing.
This article provides a technical comparison of the 3 primary methods for building that pipeline:
- Using the native Redshift COPY command
- Developing a process with AWS Glue
- Deploying an automated ELT platform.
Each approach comes with distinct trade-offs in setup, maintenance, and scalability.
Method 1: The native COPY command
Redshift’s primary tool for bulk data ingestion from Amazon S3 is the COPY command. It uses the data warehouse’s massively parallel processing (MPP) architecture to load data at high speed.
When an engineer executes the command, the leader node of the cluster divides the workload among all compute nodes, which then pull data from S3 in parallel.
The command provides direct control over the loading process and supports multiple data formats, including CSV, JSON, Avro, and Parquet.
It also includes options for managing file compression and mapping source columns to different target columns. An IAM role attached to the Redshift cluster grants secure, temporary credentials for the connection to S3.
COPY command method overview
Using the command effectively in production requires managing its core operational mechanics.
- Targeted loading with manifest files: The COPY command uses manifest files to load files selectively.
- A manifest is a JSON-formatted file that contains the explicit S3 object path for every file to be loaded.
- The command cannot generate this file, so the engineer must build a separate, external process, such as a Lambda function or an Airflow DAG, to scan S3, identify new files, and produce the manifest for each batch.
- Schema-on-write enforcement: The command operates on a strict schema-on-write model.
- The structure of the data in Amazon S3 must exactly match the column order and data types of the target Redshift table.
- Any deviation, such as a new column in the source file, will cause the command to fail. The data transfer halts until an engineer manually alters the table schema and re-executes the load.
- System tables for error logging: An engineer must query the STL_LOAD_ERRORS system table to debug all ingestion failures.
- While the command’s MAXERROR parameter allows the process to complete by skipping a specified number of rows containing errors, this creates a risk of loading incomplete data.
For well-defined, infrequent bulk loads, the COPY command is a high-performance solution.
Its primary trade-off is the significant engineering overhead required to build and maintain the surrounding automation for file discovery, schema management, and monitoring.
Method 2: AWS Glue for custom ETL
When a pipeline requires in-flight data transformation, engineers turn to AWS Glue. Glue is a serverless platform designed to run extract, transform, and load (ETL) jobs.
An engineer can write a custom script that runs on a managed Apache Spark environment to clean, reshape, or enrich data as it moves between Amazon S3 and Redshift.
The service coordinates 3 components to execute this process. First, a crawler scans the source data in S3 to infer a schema. It registers that schema as a table in the AWS Glue Data Catalog, a central metadata repository. The ETL job then uses the catalog metadata to read the S3 data, apply the transformations defined in the script, and write the final result to Redshift.
AWS Glue method overview
Using AWS Glue in production means managing the interplay between these components.
- Automated schema discovery: The Glue crawler automates the initial work of defining a table structure.
- If the source data schema changes, the crawler must be re-run to update the Data Catalog.
- The ETL script itself may then also require manual updates to handle the new or altered fields.
- Custom transformation with Spark: Glue’s primary purpose is to run custom ETL code, typically written in PySpark.
- This allows for complex business logic, data validation, and enrichment before the data is loaded.
- Operational overhead costs. Engineers are responsible for developing, testing, and maintaining this code as business requirements change.
- Incremental processing with job bookmarks: To automate incremental loads, Glue uses a feature called job bookmarks.
- When enabled, a bookmark records a timestamp after each successful run.
- On the next run, the job automatically processes only the files that have appeared in S3 since the last timestamp, which eliminates the need to build a manual file-tracking system.
AWS Glue offers fine-grained control over data transformation. It is a significant step up in complexity from the COPY command, trading its simplicity for this control at the cost of higher development and maintenance overhead.
Method 3: Automated ELT
The third method shifts the pipeline model from ETL to ELT (extract, load, transform).
This abstracts the underlying code and infrastructure entirely. Fully managed ELT platforms, like Fivetran, have pre-built S3 connectors that combine the extract and load stages into a single, automated process. All data transformation then occurs on top of the raw data after it lands in Redshift.
Automated ELT method overview
This approach changes the operational contract. The engineer’s responsibility shifts from writing and maintaining pipeline code to configuring the connection and defining business logic in SQL.
- Continuous file detection and processing: Instead of scheduled batch jobs, these platforms monitor S3 locations continuously.
- When a new or modified file appears, the service automatically validates, ingests, and commits the data to Redshift.
- This process tracks file state internally, removing the need for engineer-managed manifest files or job bookmarks.
- Automated schema migration: The architecture of these platforms anticipates schema drift, treating it as a routine event.
- When a new column appears in a source file, Fivetran automatically issues an ALTER TABLE command to add the corresponding column in Redshift before loading the new data.
- Destructive or incompatible changes may require manual review.
- Data type changes are handled by casting the target column to a compatible supertype, allowing the data transfer to continue without interruption.
- Managed execution and recovery: The service provider owns the responsibility for the pipeline's execution.
- If a file is corrupted or a network error interrupts a load, the platform's internal logic manages the retry attempts.
- The engineer is alerted to persistent issues, but the system handles the recovery process automatically.
An automated ELT platform is designed to produce a reliable data pipeline with minimal operational overhead. This approach trades the granular control of custom scripting for a significant reduction in development and maintenance time.
Comparison of loading methods
The following table directly compares the 3 methods against the core engineering requirements for a production data pipeline.
Common challenges: S3 to Amazon Redshift
A one-time data dump from S3 to Redshift is simple enough. Building that process into a reliable, production-level pipeline requires solving 3 recurring engineering problems to maintain performance and data integrity.
Incremental loading
Analytics requires fresh data, which makes reloading an entire Amazon S3 bucket for every update both inefficient and expensive. Production pipelines solve this with incremental loading, a process that identifies and ingests only new or modified files. Without an automated way to track these changes, compute costs escalate, and query performance suffers.
Schema drift
Source data schemas evolve. An application update might add a new column or change a data type, causing schema drift.
This breaks any pipeline hard-coded to a specific file structure and halts all data transfer. A production system has to anticipate this, either by detecting the changes and alerting an engineer, or by automatically propagating the new schema to the target Amazon Redshift table.
Error handling and data integrity
Pipelines sometimes fail. Source files may arrive corrupted, credentials can expire, and network issues can interrupt data transfers. Simple scripts cannot handle these exceptions and will cause data loss.
A production-ready pipeline accounts for this with built-in error handling, including detailed logging, alerts, and automated retries to protect data integrity in the warehouse.
Handling data formats and structure
The format of your source data in S3 has a major impact on your choice of ingestion method. While a structured CSV file is simple to load, semi-structured formats like JSON or columnar formats like Parquet require specific architectural solutions.
- COPY command: This command handles semi-structured data in 2 ways. You can either
- Load an entire JSON object into a single SUPER data type column for querying inside Redshift, or
- Use a JSONPath file to manually shred specific JSON keys into a flat, relational table. (This is a manual process that breaks whenever the source JSON structure changes.)
- AWS Glue: The service is designed for this type of complexity.
- The Glue crawler automatically infers the schema from nested JSON or Parquet files and registers it in the Glue Data Catalog.
- An engineer can then use this reliable schema to write a Spark ETL job that flattens the data into the precise tabular format required by Redshift.
- Automated ELT platforms: These platforms handle semi-structured data automatically.
- Such a service parses JSON during ingestion, shredding nested objects and arrays into separate, relational tables in Redshift.
- It adds foreign keys to maintain relationships, removing the need for JSONPath files or custom ETL scripts.
Automate S3 to Redshift with Fivetran
The right method for moving data from S3 to Redshift depends on the engineering and resource trade-offs your team is willing to make.
For infrequent bulk loads ->
The COPY command offers direct control but requires a significant investment in building and maintaining automation.
For pipelines that need complex, in-flight transformation ->
AWS Glue provides a managed Spark environment at the cost of developing and managing custom ETL code.
When the priority is pipeline reliability with minimal operational overhead ->
An automated ELT platform is the most efficient approach, allowing engineers to focus on building data models instead of managing infrastructure.
Fivetran provides a fully automated S3 connector to move data into Redshift.
[CTA_MODULE]
Related posts
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.