Guides
Guides
Guides

The ultimate data wrangling guide

April 8, 2023
Data wrangling is essential for reliable analytics, as it transforms inconsistent inputs into structured data and supports decision-making, forecasting, and machine learning.

Raw data is rarely fit for analysis. Whether from a SaaS application, transactional database, or CSV file, it arrives with missing values, duplicate entries, and inconsistent formatting. Using this data in its raw state produces inaccurate analytics, flawed machine learning models, and poor business decisions.

Wrangling data is notoriously difficult, and it's becoming increasingly crucial for companies investing in AI. In KPMG’s Q3 2025 AI Quarterly Pulse survey, 82% of organizations said data quality issues were their biggest barrier to success.

Mastering these wrangling steps is essential for teams that need reliable, production‑grade data to power analytics and AI at scale.

What is data wrangling?

Data wrangling, also known as data munging, transforms raw data into a structured, usable format. The process involves data structuring, cleaning, enrichment, and validation.

It is the foundational work that turns complex and messy data sets into accurate, reliable information.

Data structuring forces raw, chaotic data into a predictable format. Unstructured data sources like free-form text from support tickets or nested JSON from API calls are unusable for most analytics. This operation imposes a consistent, tabular schema with defined rows and columns, making the data accessible to standard business intelligence tools and machine learning models.

With a consistent structure in place, data cleaning corrects the inevitable errors in the data set. Duplicate records can distort sales forecasts, and inconsistent units of measurement can undermine analysis. Data cleaning eliminates duplicates and corrects errors to ensure accurate calculations.

Analysts also develop strategies for handling missing values, such as removing incomplete records or imputing values based on a statistical model.

Data enrichment enhances the data by integrating it with other relevant sources. An e-commerce company can enhance its sales transaction data by incorporating supplier shipping logs to analyze supply chain efficiency. A bank can enrich loan application data with public credit information to improve risk assessment.

The final step is data validation. A series of automated scripts and queries runs checks to confirm the previous data transformations were successful. This step verifies that the data is structurally sound, internally consistent, and meets all predefined quality standards.

Human expertise is vital for data analysis. Analysts and scientists use their domain knowledge and business understanding to discern critical insights from data anomalies

The 6 steps of the data wrangling process

Data wrangling follows a logical and iterative sequence of 6 steps. Data analysts and data scientists apply this framework to methodically transform raw data into a validated and trustworthy data set. While the process is sequential, an analyst may loop back to an earlier step to address issues revealed during the workflow

1. Discovery

Discovery is the initial forensic analysis of the raw data. Before applying any transformations, an analyst must profile the data to understand its structure, content, and quality flaws. This exploration produces a clear diagnosis of the data's condition and informs the entire wrangling strategy.

Key activities include:

  • Profiling the data to understand its basic characteristics, such as the number of rows and columns.
  • Examining the data types of each column to check for inconsistencies.
  • Running summary statistics to understand the distribution of the data, including mean, median, min, max, and distinct values.
  • Identifying patterns, correlations, and outliers that require investigation.

For example, a marketing analyst receives a CSV file of website leads. During discovery, they find the country column contains a mix of full names, two-letter ISO codes, and a high percentage of null values. They also discovered that a lead date column is stored as a string in three different formats. This diagnosis provides a clear list of problems to solve in the cleaning step.

2. Structuring

Unstructured and semi-structured data are incompatible with most relational databases and business tools. Data structuring fixes this by reshaping it into a target schema, making it easier to access, query, and use.

Key activities include:

  • Parsing semi-structured data, such as nested JSON or XML files, to extract relevant attributes.
  • Flattening nested data into a tabular format with consistent rows and columns.
  • Pivoting or unpivoting to align data structures with the requirements of a specific analytical model.
  • Ensuring the entire data set conforms to a single, unified schema.

For example, a data engineer works with event data streamed from an application's API. The raw data arrives as a series of JSON objects with nested user information. The engineer structures this data by parsing each JSON object, flattening the nested user data into separate columns, such as user_id and user_location, and creating a single, wide table where each row represents a single event.

3. Cleaning

Data cleaning corrects errors and inconsistencies within the dataset. This is often the most time-consuming step, focusing on improving the integrity of individual data points.

Key activities include:

  • Removing duplicate records to avoid inflated counts or incorrect sums.
  • Correcting typos and standardizing categorical data to a common set of values.
  • Identifying and developing a strategy for handling statistical outliers.
  • Imputing, flagging, or deleting records with missing values.

For example, a financial analyst cleans a transactional data set. They find duplicate entries caused by a system glitch, which they remove. They also identify several transactions where the transaction_amount is a negative number, which is impossible. The cleaning script is written to convert these values to their absolute value and flag them for review.

4. Enriching

Data enrichment enhances the primary data set by integrating it with information from other sources. This step adds new columns of data that provide more detailed context for analysis.

Key activities include:

  • Joining the primary data set with other data based on a common key.
  • Appending new rows of data from a similar source.
  • Augmenting the data with information from third-party data providers.

For example, a retail company has a data set of customer transactions that includes a product ID and a store location. They enrich this data in two ways. First, they join it with their internal product catalog to add product category and margin information. Then, they join it with an external demographic data set based on the store location to analyze how local population characteristics affect sales.

5. Validating

Data validation is the quality assurance step of the workflow. After transformations are complete, automated rules and checks verify the correctness and consistency of the data (such as confirming an order date cannot be in the future).

This step serves as a final quality check to prevent flawed data from being included in operational reports and models.

Key activities include:

  • Running data quality rules to check for adherence to business logic.
  • Auditing the data to ensure transformations were applied correctly.
  • Verifying that the final data set is consistent with the source, for example, by checking that total revenue figures match.

For example, after wrangling a large customer database, a data team runs a validation script. The script checks that all email addresses conform to a standard format, that no customer_id is null, and that the total number of records in the new table is within 0.1% of the record count in the source system.

6. Publishing

Publishing delivers the wrangled and validated data to its final destination. This step makes the clean, reliable data available to the business users, applications, or machine learning models that will consume it. It also includes creating clear documentation for the new data set.

Key activities include:

  • Loading the final data set into production systems, data warehouses, data marts, or data lakes.
  • Documenting the data set, including its schema, a description of each field, and an overview of the transformations applied.
  • Scheduling the data wrangling pipeline to run on a recurring basis to provide consistently fresh data.

For example, the fully wrangled sales data is loaded into a set of production tables in Snowflake. The data engineering team documents these tables in a central data catalog and provides the connection details to the analytics team. These tables now serve as the central, authoritative source for the company's executive-level Power BI dashboards.

Data wrangling techniques

Most teams follow a six-step workflow and apply targeted techniques to wrangle data. The specific approach depends on the type of data and the goals of the project.

Handling missing values

A blank cell in a dataset represents an unknown value that can skew analysis and cause machine learning models to fail. An initial analysis of the data's missing values and their frequency is needed to determine the best strategy for resolving them.

Analysts use two primary approaches:

  • Deletion: An analyst may remove records with missing values entirely. This approach is only acceptable when the data set is large and the missing values are few and randomly distributed, ensuring the removal does not introduce bias.
  • Imputation: An analyst makes an educated guess to fill the gap. Common methods include filling with a constant value, such as “0” or “Unknown”, or using a statistical measure, like the mean for normally distributed data, the median for skewed data, or the mode for categorical data.

For example, a sales data set lacks region info for 2% of records. Instead of deleting them, an analyst imputes the value using the most frequent Region for each Salesperson.

Identifying and treating outliers

Outliers are specific points in a dataset that differ significantly from the rest, often due to data entry errors or extreme events. Data teams need to identify and address these outliers to avoid skewing statistical measures and obscuring patterns.

Analysts choose from several treatments:

  • Removal: The most straightforward approach involves the analyst deleting the entire record containing the outlier, although this results in information loss.
  • Capping: This treatment assigns a maximum or minimum threshold value to the outlier. Capping prevents the extreme value from skewing the data while preserving the rest of the record.
  • Transformation: This method applies a mathematical function, such as a logarithmic scale. This transformation helps minimize the impact of outliers on the dataset.

For example, a dataset of online order quantities contains an order for 9,000 items, while 99% of orders are for fewer than 10 items. Recognizing this as a data entry error, the analyst caps the Order Quantity at a maximum value of 50.

Standardization and normalization

Many machine learning algorithms are sensitive to the scale of their input features. A feature with a large range, such as annual income, can overshadow a feature with a small range, like the number of dependents. Scaling techniques prevent this by giving all features a comparable scale.

Two main scaling techniques exist:

  • Normalization (Min-Max scaling) rescales data to a fixed range, usually 0 to 1. Analysts use this for algorithms that do not assume a specific data distribution, such as those used in neural networks.
  • Standardization (Z-score normalization) rescales data to have a mean of 0 and a standard deviation of 1. This is the preferred method for algorithms that assume the data follows a Gaussian distribution, such as linear regression.

For example, an analyst building a customer churn model applies standardization to the "Age" and "Monthly Subscription Fee" columns. This treats a one-year difference in age and a one-dollar difference in fee with equal importance.

String parsing and manipulation

Raw text fields often contain valuable but unorganized information. String parsing techniques impose a clean, consistent format to extract this information for analysis.

Common operations include:

  • Splitting a single column into multiple columns based on a delimiter.
  • Concatenating several columns into a single column.
  • Using regular expressions (regex) to find and extract specific patterns, such as an email domain or a postal code.
  • Trimming extra whitespace from the beginning or end of a string.

For example, a customer support data set has a ticket_details field that contains John Doe john.doe@example.com. The analyst uses a regular expression to parse this field and create two new columns: customer_name and email_address.

One-hot encoding for categorical data

Machine learning algorithms require numerical input. They cannot work directly with text-based categories, such as Electronics or Apparel, in a Product Category column.

One-hot encoding converts these text-based categories into a numerical format. The process transforms a single categorical column into multiple new dummy columns, one for each unique category. For a given record, the new column corresponding to its category gets a 1, while all other new columns get a 0.

For example, an analyst converts the Product Category column into three new columns: is_Electronics, is_Apparel, and is_Groceries. A product in the Apparel category receives a 1 in the is_Apparel column and a 0 in the other two.

Binning

Binning, or discretization, converts a continuous numerical variable into a smaller number of discrete bins. This technique can mitigate the impact of minor observational errors and help uncover non-linear relationships in the data.

For example, a data scientist analyzing customer age for a marketing campaign decides that individual ages are too granular. They categorize customers into four groups: 18-29, 30-45, 46-60, and 61+. This allows the marketing team to analyze and improve campaigns by age group.

Data wrangling vs. data cleaning vs. ETL

Data teams often use the terms data wrangling, data cleaning, and ETL to describe different data preparation tasks. Each process differs in its primary goal, scope, and techniques involved.

Data wrangling

This encompasses all data cleaning tasks that transform raw data into a specific format for later use. Data wrangling also includes re-structuring data and adding new information.

For example, an analyst might pivot transactional data into a customer summary, incorporate external demographic data to enrich sales records, or engineer new features for machine learning. Data wrangling proactively reshapes data to serve a particular purpose.

Data cleaning

This focuses on fixing mistakes in existing datasets. This can involve removing duplicates, correcting any formatting issues, and implementing consistent naming conventions.

ETL pipeline

ETL (Extract, Transform, Load) is a data engineering workflow for moving large data volumes between systems, mainly for consolidation and migration. The transformation step in a traditional ETL pipeline performs basic cleansing to ensure source data matches the destination schema.

Data wrangling

Data cleaning

ETL (Extract, Transform, Load)

Primary goal

Shapes data for a specific analysis or model

Corrects errors and inconsistencies in the data

Moves data from source systems to a target system

Scope

Broad; includes cleaning, structuring, and enriching

Narrow; a single component of the wrangling process

Broad; a complete data logistics pipeline

Typical user

Data analyst, data scientist

Data analyst, data engineer

Data engineer, IT professional

Nature of process

Iterative, exploratory, and ad-hoc

A defined, corrective step in a workflow

Scheduled, automated, and operational

Modern data pipelines, however, rely on the ELT (Extract, Load, Transform) method. These automated pipelines extract and load raw data from hundreds of sources into a cloud data platform, preserving original data and leveraging cloud processing for transformation.

This shift from ETL to ELT fundamentally changes where and how data preparation happens. Data wrangling is the modern, in-warehouse transformation process. Data analysts and scientists work with fully replicated source data within the cloud platform, applying complex, iterative, and exploratory transformations to prepare it for a specific dashboard, report, or machine learning model.

The ELT pipeline automates data movement, freeing data professionals to focus on high-value data wrangling tasks.

Start your data wrangling with Fivetran

All effective wrangling techniques have one prerequisite: access to the raw data. Slow, manual data collection is the most common bottleneck in any data analytics project, delaying the work before it can even begin. Without automated data movement, even the most skilled data teams are forced to wait for the raw materials they need.

Fivetran automates this foundational step, providing the data pipelines that deliver analysis-ready raw data from all sources into a central platform.

[CTA_MODULE]

Start your 14-day free trial with Fivetran today!
Get started to see how Fivetran fits into your stack
Topics
Share

Related posts

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.