We previously discussed automated data integration for CSVs, the most common type of ASCII file. Another type of ASCII file, electronic data interchange (EDI), is used to communicate between businesses to convey actions such as purchase orders (PO) and medical insurance. These file types typically do not contain repetitive/same-structure data like CSVs, but carry an aggregation of multiple types of data in a mixed format. Other file types may also exhibit this behavior, but EDI documents are probably the most common. I will demonstrate how Fivetran and the modern data stack (MDS) can quickly and simply solve this problem.
Most analytical EDI solutions require every document structure to be known and initially set up to be parsed prior to landing in the data warehouse. This post will demonstrate how to load EDI data rapidly into a data warehouse to quickly deliver insights. Moving from a traditional early binding model to a late binding model reduces time to insights, eliminates a majority of the ingestion errors and provides a process for the business to create positive outcomes in hours versus days or weeks.
Purchase order use cases
Each EDI document is assigned a unique type number. In the case of a PO, the unique number is referenced as an 850 document. The document is essentially a serialized definition of a PO, meaning the document will contain all of the expected information such as PO number, customer, delivery address and line items to name a few. A portion of an EDI file is shown below for reference.
EDI documents are used as part of a message sequence, in this case between a buyer and supplier, that come together to form a business transaction. Companies handling either side of a transaction will have invested in various technologies to implement and manage EDI transactions. Replacing this infrastructure is not part of Fivetran’s value proposition. Fivetran simply augments the existing investments by allowing analytical systems to quickly understand the conversations that are happening within these systems.
Late binding solution process
In the first post regarding CSV data ingestion, the binding of the source file data to the target table occurs natively through Fivetran. The late binding aspect will occur in the target database. For this post, the late binding aspect will be demonstrated by SQL views. In a later post, we will augment this portion utilizing a transformation tool such as dbt™.
Step1: Ingest raw data
The following image displaying the Fivetran S3 connector has two changes from the image in Part 1: the “Folder Path” is now “850x12” and the “Advanced Options” value is enabled. The advanced options setting allows each inbound data row to be treated as a single column of data. This is done since there is no EDI item available in the file types drop down list.
Two of the advanced options need to be adjusted to implement late binding (see the image below). Setting the delimiter to tilde (~) uses the EDI end of line delimiter to configure Fivetran to think of all the data being a single field (~ is not present within the EDI file). Enabling “Headerless Files” tells Fivetran to import the first row as data.
Within a few minutes of starting the connector, unparsed data will appear in the EDIRAW table (see image below). All data from the file is loaded into the column named COLUMN_0, which is the default column name created by Fivetran when column headers are unavailable. Within COLUMN_0, the data is delimited by an asterisk for each variable length data row.
Step 2: Parse raw data
Parse COLUMN_0 in the EDIRAW table creating all other columns via a view. The view has the number of columns that match the most fields that occur in a row of the raw data. For example, if the longest row has 19 field positions, the view should have 20 columns plus the metadata fields of _FILE, _LINE, _MODIFED and _FIVETRAN_SYNCED (shown below). The following SQL statement creates a view named PARSED_EDI with 20 data columns by parsing COLUMN_0 of EDIRAW table on the asterisk (*) character.
The syntax shown above is specific for parsing the data in Snowflake.
For Databricks:
For Google BigQuery:
For Redshift:
In the PARSED_EDI view, each column will contain one of the following values:
Step 3: Create record segment views
Now we can separate each record segment into a specific view. This is accomplished by creating a view for each value in the PARSED_EDI.RECORD_NAME column. An advantage here is that there is no platform-specific SQL required (see image below).
To keep this simple, no specific type conversions have been added. In a production setting, data typing would be implemented. An example of the BEG segment type is shown below.
As can be seen in the view results, three EDI files have been loaded. Each file has a different filename (_FILE) and date modified (_MODIFIED) to easily determine PO_NUMBER and EDI file correlation.
Conclusion
Fivetran can be quickly and easily configured to support complex ASCII files with multiple field lengths. Fivetran can also deliver that complex data to any supported data destination where each data row is ingested as a single column. We are able to deconstruct the complex data through late binding. The advantages of this concept are: all data is persisted as raw data, the massive parallel processing (MPP) of the cloud data platform (CDP) can be used to deconstruct the complex data into distinct columns, no 3rd party platforms or programming languages, and data can be analyzed for changes such as unhandled segments. As part of the modern data stack, Fivetran leverages this solution by loading all data very quickly and allowing the MPP CDP to do the rest.
For a transactional system this may be considered an issue, as wrongly formatted documents may cause downstream issues. In an analytics environment, though, the inverse is true. We need 100% of the data to be loaded including any errors, since the errors will allow us to further understand issues occurring within the data and the originating systems.
In subsequent installments of this series, we will cover HL7 files as well as how to productionize what we have described using dbt™.