How to Extract Data
How to Extract Data

Every analytics workflow begins with fetching data from systems where it's generated. Data extraction revolves around retrieving data from sources and landing it in a data warehouse for ELT/ETL. When done properly, it can be reliable and guide accurate decision-making.
This guide explains how extraction works, the different types of data, extraction methods, and best practices.
What is data extraction, and how does it work?
Data extraction is the process of collecting information from multiple sources and consolidating it in a single destination. It’s the first step in the ELT (Extract, Load, and Transform) process. The main purpose of data extraction is to gather all relevant data from disparate sources and move it into a data warehouse or data lake.
The extraction process involves the following steps:
The ideal extraction technique depends on data structure and source type. The next section covers some of the most common extraction methods.
Common data extraction techniques
Different data structures require different approaches. Below are various extraction methods suited for each data type:
Structured
Structured data is highly organized and follows a predefined format. It is arranged into tables with rows and columns, which makes it easy to extract.
Examples of structured data extraction include:
Semi-structured
Semi-structured data doesn’t reside in a fixed tabular schema. However, it has some organizational markers, such as XML or JSON documents, log files, or HTML web pages. The data has keys, tags, or other structure cues, which provide an arbitrary or nested arrangement. Examples of semi-structured data extraction include:
Unstructured
Unstructured data lacks any organization, making it difficult to extract. The main methods of extraction include:
Common data extraction challenges
Data extraction isn’t a straightforward task and may involve several challenges:
Schema drift
This refers to the changes in the data structure in the source system, which can cause the data pipeline to fail due to the mismatch.
Schema drift can occur due to:
- Adding new columns to the table
- Renaming a column
- Dropping a column
- Change in data type in existing columns
These changes are common, especially with systems that evolve based on business needs.
To solve this problem, use tools that detect and adapt to schema modifications, adapting to new columns or new data types.
Fivetran automatically detects schema changes like new or renamed columns and mapping updates. Learn more ->
Rate limits
Most APIs have rate limits to protect the API from exploitation, meaning there is a maximum number of requests you can make within a specific period. If you exceed the limit, the extraction is blocked or throttled, leading to incomplete data. Also, APIs might have quotas on the volume of data per request or require pagination.
This problem requires implementing strategies to extract data within the set limits. For example, batching can help fetch larger data quantities with fewer API calls. Also, configure your code to catch 429 responses and pause the extraction or slow down, and have a retry logic in place to reattempt failed operations.
Inconsistent data quality
In most cases, raw data at the source contains missing values, duplicates, or other errors that trickle down into your analytics. During extraction, blank values or malformed data can cause failures or incorrect analysis.
You should incorporate data validation and cleaning into the extraction workflow, either at the source or at the point of extraction. Implement checks for completeness and correctness, and verify that you have the expected data records. Also, handle duplicates in such a way that you don’t accidentally pull the same records more than once.
Best practices for data extraction
Below are some best practices to ensure effective data extraction:
Validate data at the source
Check the data you are extracting for quality. It’s easier to prevent the extraction of bad data at the source than fix it later. Use validation rules, such as format checks, range limits, and required fields, to ensure the data is as clean as possible. This reduces the chance of failing jobs due to unexpected values.
Also, integrate quality control measures into your pipeline. Check for out-of-range or null values in critical fields, and compare current and historic data for anomalies. The goal is to enforce data integrity at the earliest stage possible so that other processes run smoothly.
Automate incremental extraction
Avoid full data dumps for each run. After the initial full load of the dataset, set up the process to extract only new or changed data on subsequent runs. You can rely on change data capture (CDC) mechanisms or use timestamps to track changes.
Incremental extraction keeps system constantly refreshed with new data. With incremental loading, you can handle errors more easily, and if an ETL job fails, you don’t have to reprocess the entire dataset; only the failed increment.
Handle errors and logging
Implement the right error handling to ensure you catch all possible issues that can break your pipeline. Developers implement validation rules and retry logic to address these issues.
When extracting from APIs, you can catch errors using try-catch blocks with retries. You can also parse errors such as malformed JSON or unexpected data types by validating the response structure before processing. Custom error codes and messages (like 403 for permission errors or 429 for rate limit exceeded) can help categorize issues.
With logs, you can capture the context of errors, such as the records causing failures, timestamps, and affected data sources. When implementing logging, ensure you capture detailed information about every error and establish clear procedures to resolve errors.
Monitor and maintain pipelines
Monitor extraction pipelines in real-time using metrics such as resource usage, error rates, data throughput, and latency. Track key KPIs such as data processed per second or job duration.
Use tools like Prometheus to collect metrics and visualize them using Grafana dashboards or cloud equivalents like CloudWatch or Azure Monitor.
Document your extraction process
Define every step of the process, from the data sources and extraction method to validation and transformation. Use data flow diagrams (DFDs) to visually represent how data moves through the extraction system. Also, document the rules and criteria used to validate extracted data, ensuring accuracy.
Why Fivetran is a reliable data extraction tool
Effective data extraction is crucial for an effective ETL process. While it can present a wide range of challenges, using the right techniques and implementing the above best practices can lead to successful extraction.
A practical way to extract data is using a managed platform like Fivetran. With 700+ connectors, incremental CDC, and adaptation to schema drift, Fivetran can run your pipelines reliably. It provides a single dashboard to monitor the sync health, latency, and any errors you may be encountering.
[CTA_MODULE]
Start for free
Join the thousands of companies using Fivetran to centralize and transform their data.