In the previous articles I’ve discussed how to extract ASCII files with fixed (csv/tsv) and variable (EDI/HL7) formats into a supported destination by determining the input or load options based on the scenario. When the process works correctly, within a few minutes of getting access to the source, such as S3, FTP(S) etc data is quickly and efficiently loaded into the target. For fixed width files an additional advantage is that the core types are managed for you.

The final piece of the puzzle is automating the transformation logic in dbt™, so that it can run consistently and easily. This post consists of the following sections

Section 1 considers how to run the sample projects in Fivetran
Section 2 reviews the dbt code and project structure, then explains how to run within VSCode
The summary reviews the CSV/EDI/HL7 file parsing series and potential next steps

Section 1 - Running the dbt examples in Fivetran

Access to the appropriate dbt project is based on the target platform and just requires the user to retrieve the appropriate project from the this GitHub repository :

Please make a copy of the project and store it in your own Git Repository, as this will allow personal changes and updates according to your specific requirements. Once a copy has been created a Fivetran file connector can be configured to read the data. By default the pipeline must create a table called ediraw as configured in the 2nd blog post here. When connected to GitHub the screen should be similar to the following screenshot. Here the repository is my own instance I used for testing. I’ve copied the Public Key to GitHub to ensure that Fivetran can connect to the repository.

Selecting “Save & Test” at the bottom of the screen will cause Fivetran to validate the connection to the repository and enable transformations to be read. If successful, the following screen will appear:

At this point a pipeline has to be created that will create a table called EDIRAW in the schema EDITEST_AUG31. If these are incorrectly specified it will not be possible to automatically trigger the transformation when new files are loaded.

The final target, at the end of all the transformations, is a table called ‘EDI_DOCUMENTS’ and should be configured in the transformation drop down. When successfully scheduled, the result will appear as follows:

Once successfully executed, dbt generates a dependency diagram that is rendered in Fivetran as shown next. This diagram visually shows the various steps, i.e. the source table EDIRAW, being parsed into parsed_edi_data, from which each segment is individually converted into a separate table for column type conversion and naming where necessary. Once each segment has been successfully parsed, all the tables are combined using a UNION statement into EDI_DOCUMENTS.

Section 2 - Understanding the dbt code and VSCode

It is possible that readers may want to access and manipulate the project to add new segment types, validations etc. This section provides a quick overview of the dbt project, key files and ultimately how to extend and refine the project. These can be broken into the following steps:

Update to allow the project to use a different source file, beyond EDITEST_AUG31.EDIRAW.
Add new segments
Refine the UNION statements

Prior to all of this locally install the required Python versions, dbt and VSCode. Versions etc are listed here. Using GitHub, access the appropriate project for the desired destination (previously mentioned in this article).

The following is a general overview of the dbt project files that will need to be altered, depending on the target and scenario:

profiles.yml

This needs to be updated with the connection information to the database used by VSCode to execute the transformations against. For this the variables, user, password, role, database, warehouse and schema will have to be updated to match the reader’s connection details.

NOTE: The file reads the password from an environment variable such as DBT_DATABRICKS_PASSWORD or ‘DBT_SNOWFLAKE_PASSWORD’ or the equivalent for the destination, so this should be set prior to running.

models/(databricks/google/snowflake) etc.

Under this directory is the sql file that corresponds to generating one of the targets somewhere in the lineage path shown earlier. Key versions of these will be shown below.

Model files

There are three types of model files for ingesting, parsing and modeling the data corresponding to the SQL discussed in earlier posts. They will differ somewhat for each destination. The general flow is as follows in the following example, in this case generating the parsed_edi_data table.

The parsed EDI File

The first line, with the value materialized=’table’ instructs dbt to create a materialized table, versus just a view. If this line is missing a view will be created instead.

Instead of selecting from a specific table, dbt allows a logical name to be used, called source. The logical name will be used in the SQL statement rather than the real table name. Source is a combination of the source as defined within profiles.yml and the table name, in this case ediraw.

{{ config(materialized='table')}}

with source as (
select * from {{ source('target', 'ediraw') }}
)
SELECT _FILE, _MODIFIED, _LINE, REPLACE(get(SPLIT(COLUMN_0, '*'),0),'"') AS 
SEGMENT_TYPE,   REPLACE(get(SPLIT(COLUMN_0, '*'),1),'"') AS 
PARSED_COL_1,  REPLACE(get(SPLIT(COLUMN_0, '*'),2),'"') AS 
PARSED_COL_2,  REPLACE(get(SPLIT(COLUMN_0, '*'),3),'"') AS 
PARSED_COL_3,…………….  REPLACE(get(SPLIT(COLUMN_0, '*'),18),'"') AS 
PARSED_COL_18,  REPLACE(get(SPLIT(COLUMN_0, '*'),19),'"') AS 
PARSED_COL_19,  REPLACE(get(SPLIT(COLUMN_0, '*'),20),'"')  AS PARSED_COL_20 FROM source

‍

Segment mappings

Each supported segment type will have a specific table generated and described in a file. The file is named ‘edi-<segment_id>_segment.sql, so edi_n1_segment.sql is the sql to process an N1 segment. A typical segment is shown below.

Line 1 again specifies a table needs to be materialized. Source for each segment is set as the output of the previous stage, i.e. the table parsed_edi_data. All the lines so far will be the same for every segment.

The following lines are the select statement to extract the appropriate columns into a new table structure where the nams for each column is specified based on the EDI specification. Here PARSED_COL_1 is mapped to ENTITY_IDENTIFIER_CODE and PARSED_COL_2 to N1_NAME and so one.

{{ config(materialized='table')}}

with source as (
select * from {{ ref('parsed_edi_data') }}
)

SELECT _FILE,      
	_MODIFIED,      
  _LINE,      
  SEGMENT_TYPE,      
  PARSED_COL_1 AS ENTITY_IDENTIFIER_CODE,      
  PARSED_COL_2 AS N1_NAME,      
  PARSED_COL_3 AS N1_IDENTIFIER_CODE_QUALIFIER,      
  PARSED_COL_4 AS N1_IDENTIFICATION_CODE,      
  PARSED_COL_5 AS ENTITY_RELATIONSHIP_CODE,      
  PARSED_COL_6 AS N1_ENTITY_IDENTIFIER_CODE FROM source WHERE SEGMENT_TYPE = 'N1'

At the end of the SELECT statement is the WHERE clause to select only the appropriate segment rows, in this case N1.

UNION EDI_DOCUMENTS

The final outcome of all this parsing is done by performing a UNION to create a single wide table that is a strongly named version of EDI_PARSED. For this particular scenario I’ve taken an existing SQL statement and converted it to run in dbt. The reason for this is that I’ve assumed some readers are more likely to be doing this as a first version of their code rather than using some of the advanced dbt features. (If you’d like to use some of the more advanced features, or rework the example, this article here may be a good starting point).

EDI_Documents is similar to the previous files, except there is more than one source table, so each is initially specified. Each of the segment tables have to be selected and given a logical name as shown here.

{{ config(materialized='table')}}

with   
	beg_segment as (select * from {{ref('edi_beg_segment')}}),   
  ctt_segment as (select * from {{ref('edi_ctt_segment')}}),   
  dtm_segment as (select * from {{ref('edi_dtm_segment')}}),   
  ge_segment as (select * from {{ref('edi_ge_segment')}}),   
  gs_segment as (select * from {{ref('edi_gs_segment')}}),   
  iea_segment as (select * from {{ref('edi_iea_segment')}}),   
  isa_segment as (select * from {{ref('edi_isa_segment')}}),   
  itd_segment as (select * from {{ref('edi_itd_segment')}}),   
  n1_segment as (select * from {{ref('edi_n1_segment')}}),   
  n3_segment as (select * from {{ref('edi_n3_segment')}}),   
  n4_segment as (select * from {{ref('edi_n4_segment')}}),   
  pid_segment as (select * from {{ref('edi_pid_segment')}}),   
  po1_segment as (select * from {{ref('edi_po1_segment')}}),   
  po4_segment as (select * from {{ref('edi_po4_segment')}}),   
  ref_segment as (select * from {{ref('edi_ref_segment')}}),   
  se_segment as (select * from {{ref('edi_se_segment')}}),   
  st_segment as (select * from {{ref('edi_st_segment')}})

‍
For each source segment the standard SELECT and UNION is defined; shown below is a subset of the ISA_SEGMENT data to be added. Beyond the standard _file, _line, _modified and _segment_type, the specific fields AUTHORIZATION_INFORMATION_QUALIFIER thru to COMPONENT_ELEMENT_SEPARATOR are selected. All other columns are specified but are given a null value.

SELECT _file, _line, _modified, segment_type,
       AUTHORIZATION_INFORMATION_QUALIFIER,
       AUTHORIZATION_INFORMATION,
       SECURITY_INFORMATION_QUALIFIER,
       SECURITY_INFORMATION,
       INTERCHANGE_ID_QUALIFIER_1,
       INTERCHANGE_SENDER_ID,
       INTERCHANGE_ID_QUALIFIER_2,
       INTERCHANGE_RECEIVER_ID,
       INTERCHANGE_TIME,
       INTERCHANGE_DATE,
       REPETIION_SEPARATOR,
       INTERCHANGE_CONTROL_VERSION_NUMBER_CODE,
       INTERCHANGE_CONTROL_NUMBER,
       ACKNOWLEDGEMENT_REQUESTED_CODE,
       INTECHANGE_USAGE_INDICATOR_CODE,
       COMPONENT_ELEMENT_SEPARATOR,
       null AS FUNCTIONAL_IDENTIFIER_CODE,
       null AS APPLICATION_SENDERS_CODE,
       null AS APPLICATION_RECEIVERS_CODE,
       null AS GS_DATE,
       null AS GE_GROUP_CONTROL_NUMBER,
       null AS IEA_NUMBER_OF_INCLUDED_FUNCTIONAL_GROUPS,
       null AS IEA_INTERCHANGE_CONTROL_NUMBER
       FROM isa_segment
UNION
SELECT _file,_line, _modified,segment_type,
………….

Again, this is not the most elegant way with dbt, but provides a demonstration on how to quickly convert existing SQL code.

VSCode

VSCode can be used to run, modify and test the dbt project in a standard developer environment without having to configure Fivetran. Once the project has been loaded into the environment and the appropriate environment variables, such as DBT_DATABRICKS_PASSWORD has been set (see profiles.yml for this information) everything is good to run.

Prior to running, the specified warehouse/data-lake will require files to load and they will be expected to be available. To run the model, issue the following command with the appropriate target i.e. databricks, google or snowflake, examples will be shown in the Readme for the project

dbt run -s models/snowflake --target [databricks|google|snowflake] --profiles-dir .

When successful the output will be as follows

david.millman@david dmillman_edi_parsing % dbt run -s models/snowflake --target snowflake --profiles-dir .
14:54:54  Running with dbt=1.2.0
14:54:54  [WARNING]: Configuration paths exist in your dbt_project.yml file which do not apply to any resources.
There are 1 unused configuration paths:
- models.dmillman_edi_parsing.example

14:54:54  Found 19 models, 0 tests, 0 snapshots, 0 analyses, 267 macros, 0 operations, 0 seed files, 2 sources, 0 exposures, 0 metrics
14:54:54  
14:54:56  Concurrency: 4 threads (target='snowflake')
14:54:56  
14:54:56  1 of 19 START table model DBT_SCHEMA.parsed_edi_data ........................... [RUN]
14:54:58  1 of 19 OK created table model DBT_SCHEMA.parsed_edi_data ...................... [SUCCESS 1 in 1.89s]
14:54:58  2 of 19 START table model DBT_SCHEMA.edi_beg_segment ........................... [RUN]
14:54:58  3 of 19 START table model DBT_SCHEMA.edi_ctt_segment ........................... [RUN]
14:54:58  4 of 19 START table model DBT_SCHEMA.edi_dtm_segment ........................... [RUN]
14:54:58  5 of 19 START table model DBT_SCHEMA.edi_ge_segment ............................ [RUN]
14:55:00  3 of 19 OK created table model DBT_SCHEMA.edi_ctt_segment ...................... [SUCCESS 1 in 1.75s]
14:55:00  5 of 19 OK created table model DBT_SCHEMA.edi_ge_segment ....................... [SUCCESS 1 in 1.75s]
14:55:00  6 of 19 START table model DBT_SCHEMA.edi_gs_segment ............................ [RUN]
14:55:00  7 of 19 START table model DBT_SCHEMA.edi_iea_segment ........................... [RUN]
14:55:00  2 of 19 OK created table model DBT_SCHEMA.edi_beg_segment ...................... [SUCCESS 1 in 1.76s]
14:55:00  8 of 19 START table model DBT_SCHEMA.edi_isa_segment ........................... [RUN]
14:55:00  4 of 19 OK created table model DBT_SCHEMA.edi_dtm_segment ...................... [SUCCESS 1 in 1.78s]
14:55:00  9 of 19 START table model DBT_SCHEMA.edi_itd_segment ........................... [RUN]
14:55:02  7 of 19 OK created table model DBT_SCHEMA.edi_iea_segment ...................... [SUCCESS 1 in 1.58s]
14:55:02  10 of 19 START table model DBT_SCHEMA.edi_n1_segment ........................... [RUN]
14:55:02  8 of 19 OK created table model DBT_SCHEMA.edi_isa_segment ...................... [SUCCESS 1 in 1.66s]
14:55:02  11 of 19 START table model DBT_SCHEMA.edi_n3_segment ........................... [RUN]
14:55:02  9 of 19 OK created table model DBT_SCHEMA.edi_itd_segment ...................... [SUCCESS 1 in 2.18s]
14:55:02  6 of 19 OK created table model DBT_SCHEMA.edi_gs_segment ....................... [SUCCESS 1 in 2.21s]
14:55:02  12 of 19 START table model DBT_SCHEMA.edi_n4_segment ........................... [RUN]
14:55:02  13 of 19 START table model DBT_SCHEMA.edi_pid_segment .......................... [RUN]
14:55:03  10 of 19 OK created table model DBT_SCHEMA.edi_n1_segment ...................... [SUCCESS 1 in 1.87s]
14:55:03  14 of 19 START table model DBT_SCHEMA.edi_po1_segment .......................... [RUN]
14:55:03  11 of 19 OK created table model DBT_SCHEMA.edi_n3_segment ...................... [SUCCESS 1 in 1.79s]
14:55:03  15 of 19 START table model DBT_SCHEMA.edi_po4_segment .......................... [RUN]
14:55:04  12 of 19 OK created table model DBT_SCHEMA.edi_n4_segment ...................... [SUCCESS 1 in 1.64s]
14:55:04  16 of 19 START table model DBT_SCHEMA.edi_ref_segment .......................... [RUN]
14:55:04  13 of 19 OK created table model DBT_SCHEMA.edi_pid_segment ..................... [SUCCESS 1 in 1.74s]
14:55:04  17 of 19 START table model DBT_SCHEMA.edi_se_segment ........................... [RUN]
14:55:05  14 of 19 OK created table model DBT_SCHEMA.edi_po1_segment ..................... [SUCCESS 1 in 1.36s]
14:55:05  18 of 19 START table model DBT_SCHEMA.edi_st_segment ........................... [RUN]
14:55:05  15 of 19 OK created table model DBT_SCHEMA.edi_po4_segment ..................... [SUCCESS 1 in 1.49s]
14:55:05  16 of 19 OK created table model DBT_SCHEMA.edi_ref_segment ..................... [SUCCESS 1 in 1.29s]
14:55:05  17 of 19 OK created table model DBT_SCHEMA.edi_se_segment ...................... [SUCCESS 1 in 1.43s]
14:55:06  18 of 19 OK created table model DBT_SCHEMA.edi_st_segment ...................... [SUCCESS 1 in 1.49s]
14:55:06  19 of 19 START table model DBT_SCHEMA.edi_documents ............................ [RUN]
14:55:11  19 of 19 OK created table model DBT_SCHEMA.edi_documents ....................... [SUCCESS 1 in 4.60s]
14:55:11  
14:55:11  Finished running 19 table models in 0 hours 0 minutes and 17.09 seconds (17.09s).
14:55:11  
14:55:11  Completed successfully
14:55:11  
14:55:11  Done. PASS=19 WARN=0 ERROR=0 SKIP=0 TOTAL=19

Note that dbt determines the order based on dependencies between the transformations and created tables and views.

Data file

Any EDI-compliant file can be processed in this pattern; a sample has been provided in the github project called ‘850_X12_4010.txt. For testing purposes I’ve made copies and changed this file to validate that the pattern works.

Summary

This set of blog posts has guided the reader on techniques that are directly supported by Fivetran, such as the ingestion of CSV files. We then looked at how Fivetran’s ELT capabilities can be used to support extended use-cases and file types such as EDI and HL7 file types and finally developed in dbt. Going forward, anyone wanting to ingest EDI files has a starting point to quickly load, parse and place in a target table ready for consumption.

The samples provided here are for demonstration purposes only, not meant for production and no warranty or support is implied. It is also acknowledged that further work could be performed, to make use of other Fivetran and dbt related functionality, such as:

Error and warning handling - e.g. if an unexpected segment type is found, or segment has more columns than required. This can be easily added, but it is out of scope for this series.
Orchestrating the merging of EDI data with data from other systems, such as SAP to create a supply chain 360 view.
Updating the UNION EDI_DOCUMENTS file to use native dbt processing, to reduce the amount of SQL to maintain.

Good luck and if you have any questions please do not hesitate to reach out via email david.millman@fivetran.com.

‍

Data insights

Using dbt to integrate and transform ASCII files

July 11, 2023

David Millman

Partner Sales Engineer

Fivetran

Anchor Link