We previously discussed automated data integration for the most common type of ASCII file, CSVs, as well as EDI files. HL7 documents have a lot of the characteristics of EDI, but add an extra twist. Records in an HL7 file may be nested. Instead of a single level of parsing, they require a second level. HL7 documents allow patient and visit information to be reliably distributed between parties such as hospitals and insurance companies. A single HL7 document can contain a combination of embedded data objects or records such as patient visit information, medications administered and surgeries performed.
Differences between EDI and HL7
Below is a sample of an HL7 file. One noticeable difference is the pipe field delimiter instead of the asterisk. This difference is easily handled in the database by altering the parsing process. The second, and much larger issue, is that there is no record delimiter where EDI files used a tilde.
Some changes are required in the Fivetran S3 connector configuration to load this data from an S3 bucket into a data warehouse. In the Advanced Options section, an asterisk is chosen as the Delimiter. The delimiter selected is not found in the file; this instructs Fivetran to not break the data into individual fields. Also, since there are no column headers, the Headerless Files option is selected.
Upon loading, Fivetran creates a landing table, here named INITIAL. All the data from the file is found in a single column, COLUMN_0. We will parse this column in the next step.
The INITIAL table is parsed by creating a view called PARSED_HL7. The SQL used to generate the PARSED_HL7 view is shown below (here, in the dialect appropriate to Snowflake – later on, we will include some adjustments for other platforms!). Note that HL7 records used in this example potentially have more columns. So columns named COL1-COL52 have been created, and the values are extracted using the pipe delimiter.
Parsing COLUMN_0 for other CDPs, alongside Snowflake, is shown below. The parsing is performed by updating the statement with the appropriate string functions.
PARSED1_HL7 is then used to derive the appropriate segment types. The column RECORD_NAME is used to specify the segment. For this example, the MSH segment will be used to create a specific segment view as it is a common segment and contains a sub field that needs further parsing.
The structure of the MSH field is defined using the above SQL Create View statement. Parsing of the main records by pipe has been performed via the view using standard SQL column mapping. The result is shown below:
An interesting segment of the record is the MESSAGE_TYPE field as it contains a nested record structure of Main and Sub message types. This is denoted in the following screenshot by the values ADT^01 and ADT^04, where the caret (^) symbol is used to delimit the two components.
MESSAGE_TYPE needs to be parsed. This is done using the same pattern as in the above example by creating a new view called HL7_MSH_MESG_TYPE. The SQL for Snowflake would be:
This statement will need to be modified for other CDPs:
The result is shown below.
ETL vs ELT
Early binding solutions, such as ETL solutions, will reject data as it is loaded in the database. Late binding solutions can allow data to be parsed once in the CDP landing zone. This is especially important in analytic/innovation style use-cases. In the previous screenshot of the HL7_MSH_MESG_TYPE, note that there are no other segment types listed, such as PID, NK1 and PV1. This is also an advantage of late binding whereas all the data is still kept in the INITIAL table and can be extracted as needed. The upshot is that data can be loaded with minimal impact where breakages can be researched and corrected within the CDP.
Fivetran’s ASCII file ingestion connectors (FTP, S3, Google Drive, etc) provide simple and speedy solutions to assist you in processing complex file types such as EDI and HL7 by harnessing the power of the CDP and reinforcing the ELT paradigm in your Modern Data Stack.
In the next and final installment of this series we will discuss how to productionize the SQL we have written for ASCII files using dbt™.