In the previous articles I’ve discussed how to extract ASCII files with fixed (csv/tsv) and variable (EDI/HL7) formats into a supported destination by determining the input or load options based on the scenario. When the process works correctly, within a few minutes of getting access to the source, such as S3, FTP(S) etc data is quickly and efficiently loaded into the target. For fixed width files an additional advantage is that the core types are managed for you.
The final piece of the puzzle is automating the transformation logic in dbt™, so that it can run consistently and easily. This post consists of the following sections
- Section 1 considers how to run the sample projects in Fivetran
- Section 2 reviews the dbt code and project structure, then explains how to run within VSCode
- The summary reviews the CSV/EDI/HL7 file parsing series and potential next steps
Section 1 - Running the dbt examples in Fivetran
Access to the appropriate dbt project is based on the target platform and just requires the user to retrieve the appropriate project from the this GitHub repository :
Please make a copy of the project and store it in your own Git Repository, as this will allow personal changes and updates according to your specific requirements. Once a copy has been created a Fivetran file connector can be configured to read the data. By default the pipeline must create a table called ediraw as configured in the 2nd blog post here. When connected to GitHub the screen should be similar to the following screenshot. Here the repository is my own instance I used for testing. I’ve copied the Public Key to GitHub to ensure that Fivetran can connect to the repository.
Selecting “Save & Test” at the bottom of the screen will cause Fivetran to validate the connection to the repository and enable transformations to be read. If successful, the following screen will appear:
At this point a pipeline has to be created that will create a table called EDIRAW in the schema EDITEST_AUG31. If these are incorrectly specified it will not be possible to automatically trigger the transformation when new files are loaded.
The final target, at the end of all the transformations, is a table called ‘EDI_DOCUMENTS’ and should be configured in the transformation drop down. When successfully scheduled, the result will appear as follows:
Once successfully executed, dbt generates a dependency diagram that is rendered in Fivetran as shown next. This diagram visually shows the various steps, i.e. the source table EDIRAW, being parsed into parsed_edi_data, from which each segment is individually converted into a separate table for column type conversion and naming where necessary. Once each segment has been successfully parsed, all the tables are combined using a UNION statement into EDI_DOCUMENTS.
Section 2 - Understanding the dbt code and VSCode
It is possible that readers may want to access and manipulate the project to add new segment types, validations etc. This section provides a quick overview of the dbt project, key files and ultimately how to extend and refine the project. These can be broken into the following steps:
- Update to allow the project to use a different source file, beyond EDITEST_AUG31.EDIRAW.
- Add new segments
- Refine the UNION statements
Prior to all of this locally install the required Python versions, dbt and VSCode. Versions etc are listed here. Using GitHub, access the appropriate project for the desired destination (previously mentioned in this article).
The following is a general overview of the dbt project files that will need to be altered, depending on the target and scenario:
Model files
There are three types of model files for ingesting, parsing and modeling the data corresponding to the SQL discussed in earlier posts. They will differ somewhat for each destination. The general flow is as follows in the following example, in this case generating the parsed_edi_data table.
The parsed EDI File
The first line, with the value materialized=’table’ instructs dbt to create a materialized table, versus just a view. If this line is missing a view will be created instead.
Instead of selecting from a specific table, dbt allows a logical name to be used, called source. The logical name will be used in the SQL statement rather than the real table name. Source is a combination of the source as defined within profiles.yml and the table name, in this case ediraw.
Segment mappings
Each supported segment type will have a specific table generated and described in a file. The file is named ‘edi-<segment_id>_segment.sql, so edi_n1_segment.sql is the sql to process an N1 segment. A typical segment is shown below.
Line 1 again specifies a table needs to be materialized. Source for each segment is set as the output of the previous stage, i.e. the table parsed_edi_data. All the lines so far will be the same for every segment.
The following lines are the select statement to extract the appropriate columns into a new table structure where the nams for each column is specified based on the EDI specification. Here PARSED_COL_1 is mapped to ENTITY_IDENTIFIER_CODE and PARSED_COL_2 to N1_NAME and so one.
At the end of the SELECT statement is the WHERE clause to select only the appropriate segment rows, in this case N1.
UNION EDI_DOCUMENTS
The final outcome of all this parsing is done by performing a UNION to create a single wide table that is a strongly named version of EDI_PARSED. For this particular scenario I’ve taken an existing SQL statement and converted it to run in dbt. The reason for this is that I’ve assumed some readers are more likely to be doing this as a first version of their code rather than using some of the advanced dbt features. (If you’d like to use some of the more advanced features, or rework the example, this article here may be a good starting point).
EDI_Documents is similar to the previous files, except there is more than one source table, so each is initially specified. Each of the segment tables have to be selected and given a logical name as shown here.
For each source segment the standard SELECT and UNION is defined; shown below is a subset of the ISA_SEGMENT data to be added. Beyond the standard _file, _line, _modified and _segment_type, the specific fields AUTHORIZATION_INFORMATION_QUALIFIER thru to COMPONENT_ELEMENT_SEPARATOR are selected. All other columns are specified but are given a null value.
Again, this is not the most elegant way with dbt, but provides a demonstration on how to quickly convert existing SQL code.
VSCode
VSCode can be used to run, modify and test the dbt project in a standard developer environment without having to configure Fivetran. Once the project has been loaded into the environment and the appropriate environment variables, such as DBT_DATABRICKS_PASSWORD has been set (see profiles.yml for this information) everything is good to run.
Prior to running, the specified warehouse/data-lake will require files to load and they will be expected to be available. To run the model, issue the following command with the appropriate target i.e. databricks, google or snowflake, examples will be shown in the Readme for the project
When successful the output will be as follows
Note that dbt determines the order based on dependencies between the transformations and created tables and views.
Data file
Any EDI-compliant file can be processed in this pattern; a sample has been provided in the github project called ‘850_X12_4010.txt. For testing purposes I’ve made copies and changed this file to validate that the pattern works.
Summary
This set of blog posts has guided the reader on techniques that are directly supported by Fivetran, such as the ingestion of CSV files. We then looked at how Fivetran’s ELT capabilities can be used to support extended use-cases and file types such as EDI and HL7 file types and finally developed in dbt. Going forward, anyone wanting to ingest EDI files has a starting point to quickly load, parse and place in a target table ready for consumption.
The samples provided here are for demonstration purposes only, not meant for production and no warranty or support is implied. It is also acknowledged that further work could be performed, to make use of other Fivetran and dbt related functionality, such as:
- Error and warning handling - e.g. if an unexpected segment type is found, or segment has more columns than required. This can be easily added, but it is out of scope for this series.
- Orchestrating the merging of EDI data with data from other systems, such as SAP to create a supply chain 360 view.
- Updating the UNION EDI_DOCUMENTS file to use native dbt processing, to reduce the amount of SQL to maintain.
Good luck and if you have any questions please do not hesitate to reach out via email [email protected].