With the release of Fivetran Transformations, Fivetran is officially a full-fledged ELT tool. With Fivetran Transformations, you can extract, load and transform in the same environment, allowing you to leverage scheduling and version control without having to jury-rig or configure other tools and technologies.
There are two main purposes for transforming data. One is for legibility and organizational coherence, similar to the approach Looker takes to models. The other is to enhance the performance and cost characteristics of your data warehouse.
Likewise, there are two methods for transforming data. The first method, the view, is not stored on disk. Every time someone accesses the view, the system runs the corresponding query to create it. In an ideal world with zero latency and unlimited computational resources, all transformations would simply be views.
Materialized views, on the other hand, are stored on disk because views generated on the fly from a large table or complex query can be extremely non-performant. In fact, the materialized views approach is sometimes necessary to get any performance at all. For instance, Resources exceeded errors on BigQuery can easily crop up if you try to surface tables with hundreds of millions of rows.
The Fivetran transformation layer is an implementation of the materialized views approach. It constructs new tables and updates them on a schedule of your choosing. The interface looks like this:
Note that your SQL script should contain the create or replace table syntax, as well as the name of the schema and the table that the transformation will populate.
An overview of transformations
The transformations you will perform can be sorted into the following categories:
- Cleaning: altering values for formatting compatibility
- Format revision/conversion: replacing incompatible characters, converting units, converting date formatting
- Key restructuring: altering values so that they can be used as keys across tables
- Deduplication: identifying and removing duplicate records.
- Data validation: evaluate the validity of a record by the completeness of the data
- Derivation: performing simple cross-column calculations
- Summarization: using aggregate functions to get summary values
- Pivoting: turning row values into columns
- Sorting or ordering and indexing to improve search performance
- Splitting: splitting a single column into multiple columns — used for fields containing delimited values, or for converting a column with multiple possible categorical values into dummy variables for regression analysis
- Filtering: selecting only certain rows or columns
- Joining: linking data across sources
- Merging/appending/union: combining records from multiple sources
- Integration: reconcile names and values for the same data element across different sources
Suppose you had a table original.activity_table with the following columns: