Automated data integration for HL7 files

A continuation of our series on ASCII files; how to ingest data from HL7 files
May 26, 2023

We previously discussed automated data integration for the most common type of ASCII file, CSVs, as well as EDI files. HL7 documents have a lot of the characteristics of EDI, but add an extra twist. Records in an HL7 file may be nested. Instead of a single level of parsing, they require a second level. HL7 documents allow patient and visit information to be reliably distributed between parties such as hospitals and insurance companies. A single HL7 document can contain a combination of embedded data objects or records such as patient visit information, medications administered and surgeries performed.

Differences between EDI and HL7

Below is a sample of an HL7 file. One noticeable difference is the pipe field delimiter instead of the asterisk. This difference is easily handled in the database by altering the parsing process. The second, and much larger issue, is that there is no record delimiter where EDI files used a tilde.

MSH|^~\&|ADT1|MCM|LABADT|MCM|198808181126|SECURITY|ADT^A01|MSG00001-|P|2.6 EVN|A01|198808181123
PID|||PATID1234^5^M11^^AN||JONES^WILLIAM^A^III||19610615|M||2106-3|677 DELAWARE 
AVENUE^^EVERETT^MA^02149|GL|(919)379-1212|(919)271-3434~(919)277-3114||S||PATID12345001^2^M10^^ACSN|
123456789|9-87654^NC
NK1|1|JONES^BARBARA^K|SPO|||||20011105
NK1|1|JONES^MICHAEL^A|FTH
PV1|1|I|2000^2012^01||||004777^LEBAUER^SIDNEY^J.|||SUR||-||ADM|A0
AL1|1||^PENICILLIN||CODE16~CODE17~CODE18
AL1|2||^CAT DANDER||CODE257
DG1|001|I9|1550|MAL NEO LIVER, PRIMARY|19880501103005|F
PR1|2234|M11|111^CODE151|COMMON PROCEDURES|198809081123
ROL|45^RECORDER^ROLE MASTER LIST|AD|RO|KATE^SMITH^ELLEN|199505011201
GT1|1122|1519|BILL^GATES^A
IN1|001|A357|1234|BCMD|||||132987
IN2|ID1551001|123456789
ROL|45^RECORDER^ROLE MASTER LIST|AD|RO|KATE^ELLEN|199505011201

Some changes are required in the Fivetran S3 connector configuration to load this data from an S3 bucket into a data warehouse. In the Advanced Options section, an asterisk is chosen as the Delimiter. The delimiter selected is not found in the file; this instructs Fivetran to not break the data into individual fields. Also, since there are no column headers, the Headerless Files option is selected.

Upon loading, Fivetran creates a landing table, here named INITIAL. All the data from the file is found in a single column, COLUMN_0. We will parse this column in the next step.

The INITIAL table is parsed by creating a view called PARSED_HL7. The SQL used to generate the PARSED_HL7 view is shown below (here, in the dialect appropriate to Snowflake – later on, we will include some adjustments for other platforms!). Note that HL7 records used in this example potentially have more columns. So columns named COL1-COL52 have been created, and the values are extracted using the pipe delimiter. 

create or replace view DAVID_MILLMAN.HL7.PARSED1_HL7(	
	_FILE,	
	_MODIFIED,	
	_LINE,	
	RECORD_NAME,		COL1,COL2,COL3,COL4,COL5,COL6,COL7,COL8,COL9,COL10,COL11,COL12,COL13,COL14,COL15,COL16,	
	COL17,COL18,COL19,COL20,COL21,COL22,COL23,COL24,COL25,COL26,COL27,COL28,COL29,COL30,	
	COL31,COL32,COL33,COL34,COL35,COL36,COL37,COL38,COL39,COL40,COL41,COL42,COL43,COL44,	
	COL45,COL46,COL47,COL48,COL49,COL50,COL51,COL52
) as 
SELECT _FILE, _MODIFIED, _LINE,
	REPLACE(get(SPLIT(COLUMN_0, '|'),0),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),1),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),2),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),3),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),4),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),5),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),6),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),7),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),8),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),9),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),10),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),11),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),12),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),13),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),14),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),15),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),16),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),17),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),18),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),19),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),20),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),21),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),22),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),23),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),24),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),25),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),26),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),27),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),28),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),29),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),30),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),31),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),32),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),33),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),34),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),35),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),36),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),37),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),38),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),39),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),40),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),41),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),42),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),43),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),44),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),45),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),46),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),47),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),48),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),49),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),50),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),51),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),52),'"')

FROM INITIAL;

Parsing COLUMN_0 for other CDPs, alongside Snowflake, is shown below. The parsing is performed by updating the statement with the appropriate string functions.

Statement Platform
element_at(SPLIT(COLUMN_0, '[|]'),3), Databricks
SPLIT(COLUMN_0, '|')[SAFE_OFFSET(0)], Google Big Query
REPLACE(get(SPLIT(COLUMN_0, '|'),0),'"'), Snowflake
SPLIT_PART(COLUMN_0, '|',1), Redshift

PARSED1_HL7 is then used to derive the appropriate segment types. The column RECORD_NAME is used to specify the segment. For this example, the MSH segment will be used to create a specific segment view as it is a common segment and contains a sub field that needs further parsing.

The structure of the MSH field is defined using the above SQL Create View statement. Parsing of the main records by pipe has been performed via the view using standard SQL column mapping. The result is shown below:

An interesting segment of the record is the MESSAGE_TYPE field as it contains a nested record structure of Main and Sub message types. This is denoted in the following screenshot by the values ADT^01 and ADT^04, where the caret (^) symbol is used to delimit the two components.

MESSAGE_TYPE needs to be parsed. This is done using the same pattern as in the above example by creating a new view called HL7_MSH_MESG_TYPE. The SQL for Snowflake would be:

create or replace view DAVID_MILLMAN.HL7.HL7_MSH_MESG_TYPE(	
	_FILE,	
  MODIFIED,	
  _LINE,	
  RECORD_NAME,	
  MAIN_MSG_TYPE,	
  SUB_MSG_TYPE
) as SELECT 
_FILE, MODIFIED, _LINE, RECORD_NAME, REPLACE(get(SPLIT(MESSAGE_TYPE, '^'),0),'"'), 
REPLACE(get(SPLIT(MESSAGE_TYPE, '^'),1),'"') FROM HL7_SEGMENT_MSH;

This statement will need to be modified for other CDPs:

Statement Platform
element_at(SPLIT(MESSAGE_TYPE, '[^]'),1), Databricks
SPLIT(MESSAGE_TYPE, '^')[SAFE_OFFSET(0)], Google Big Query
REPLACE(get(SPLIT(MESSAGE_TYPE, '^'),0),'"'), Snowflake
SPLIT_PART(MESSAGE_TYPE, '^',1), Redshift

The result is shown below.

ETL vs ELT

Early binding solutions, such as ETL solutions, will reject data as it is loaded in the database. Late binding solutions can allow data to be parsed once in the CDP landing zone. This is especially important in analytic/innovation style use-cases. In the previous screenshot of the HL7_MSH_MESG_TYPE, note that there are no other segment types listed, such as PID, NK1 and PV1. This is also an advantage of late binding whereas all the data is still kept in the INITIAL table and can be extracted as needed. The upshot is that data can be loaded with minimal impact where breakages can be researched and corrected within the CDP.

Fivetran’s ASCII file ingestion connectors (FTP, S3, Google Drive, etc) provide simple and speedy solutions to assist you in processing complex file types such as EDI and HL7 by harnessing the power of the CDP and reinforcing the ELT paradigm in your Modern Data Stack.

In the next and final installment of this series we will discuss how to productionize the SQL we have written for ASCII files using dbt™.

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Data insights
Data insights

Automated data integration for HL7 files

Automated data integration for HL7 files

May 26, 2023
May 26, 2023
Automated data integration for HL7 files
A continuation of our series on ASCII files; how to ingest data from HL7 files

We previously discussed automated data integration for the most common type of ASCII file, CSVs, as well as EDI files. HL7 documents have a lot of the characteristics of EDI, but add an extra twist. Records in an HL7 file may be nested. Instead of a single level of parsing, they require a second level. HL7 documents allow patient and visit information to be reliably distributed between parties such as hospitals and insurance companies. A single HL7 document can contain a combination of embedded data objects or records such as patient visit information, medications administered and surgeries performed.

Differences between EDI and HL7

Below is a sample of an HL7 file. One noticeable difference is the pipe field delimiter instead of the asterisk. This difference is easily handled in the database by altering the parsing process. The second, and much larger issue, is that there is no record delimiter where EDI files used a tilde.

MSH|^~\&|ADT1|MCM|LABADT|MCM|198808181126|SECURITY|ADT^A01|MSG00001-|P|2.6 EVN|A01|198808181123
PID|||PATID1234^5^M11^^AN||JONES^WILLIAM^A^III||19610615|M||2106-3|677 DELAWARE 
AVENUE^^EVERETT^MA^02149|GL|(919)379-1212|(919)271-3434~(919)277-3114||S||PATID12345001^2^M10^^ACSN|
123456789|9-87654^NC
NK1|1|JONES^BARBARA^K|SPO|||||20011105
NK1|1|JONES^MICHAEL^A|FTH
PV1|1|I|2000^2012^01||||004777^LEBAUER^SIDNEY^J.|||SUR||-||ADM|A0
AL1|1||^PENICILLIN||CODE16~CODE17~CODE18
AL1|2||^CAT DANDER||CODE257
DG1|001|I9|1550|MAL NEO LIVER, PRIMARY|19880501103005|F
PR1|2234|M11|111^CODE151|COMMON PROCEDURES|198809081123
ROL|45^RECORDER^ROLE MASTER LIST|AD|RO|KATE^SMITH^ELLEN|199505011201
GT1|1122|1519|BILL^GATES^A
IN1|001|A357|1234|BCMD|||||132987
IN2|ID1551001|123456789
ROL|45^RECORDER^ROLE MASTER LIST|AD|RO|KATE^ELLEN|199505011201

Some changes are required in the Fivetran S3 connector configuration to load this data from an S3 bucket into a data warehouse. In the Advanced Options section, an asterisk is chosen as the Delimiter. The delimiter selected is not found in the file; this instructs Fivetran to not break the data into individual fields. Also, since there are no column headers, the Headerless Files option is selected.

Upon loading, Fivetran creates a landing table, here named INITIAL. All the data from the file is found in a single column, COLUMN_0. We will parse this column in the next step.

The INITIAL table is parsed by creating a view called PARSED_HL7. The SQL used to generate the PARSED_HL7 view is shown below (here, in the dialect appropriate to Snowflake – later on, we will include some adjustments for other platforms!). Note that HL7 records used in this example potentially have more columns. So columns named COL1-COL52 have been created, and the values are extracted using the pipe delimiter. 

create or replace view DAVID_MILLMAN.HL7.PARSED1_HL7(	
	_FILE,	
	_MODIFIED,	
	_LINE,	
	RECORD_NAME,		COL1,COL2,COL3,COL4,COL5,COL6,COL7,COL8,COL9,COL10,COL11,COL12,COL13,COL14,COL15,COL16,	
	COL17,COL18,COL19,COL20,COL21,COL22,COL23,COL24,COL25,COL26,COL27,COL28,COL29,COL30,	
	COL31,COL32,COL33,COL34,COL35,COL36,COL37,COL38,COL39,COL40,COL41,COL42,COL43,COL44,	
	COL45,COL46,COL47,COL48,COL49,COL50,COL51,COL52
) as 
SELECT _FILE, _MODIFIED, _LINE,
	REPLACE(get(SPLIT(COLUMN_0, '|'),0),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),1),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),2),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),3),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),4),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),5),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),6),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),7),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),8),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),9),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),10),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),11),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),12),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),13),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),14),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),15),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),16),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),17),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),18),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),19),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),20),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),21),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),22),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),23),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),24),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),25),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),26),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),27),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),28),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),29),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),30),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),31),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),32),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),33),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),34),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),35),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),36),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),37),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),38),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),39),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),40),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),41),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),42),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),43),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),44),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),45),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),46),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),47),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),48),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),49),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),50),'"') ,
  REPLACE(get(SPLIT(COLUMN_0, '|'),51),'"'),
  REPLACE(get(SPLIT(COLUMN_0, '|'),52),'"')

FROM INITIAL;

Parsing COLUMN_0 for other CDPs, alongside Snowflake, is shown below. The parsing is performed by updating the statement with the appropriate string functions.

Statement Platform
element_at(SPLIT(COLUMN_0, '[|]'),3), Databricks
SPLIT(COLUMN_0, '|')[SAFE_OFFSET(0)], Google Big Query
REPLACE(get(SPLIT(COLUMN_0, '|'),0),'"'), Snowflake
SPLIT_PART(COLUMN_0, '|',1), Redshift

PARSED1_HL7 is then used to derive the appropriate segment types. The column RECORD_NAME is used to specify the segment. For this example, the MSH segment will be used to create a specific segment view as it is a common segment and contains a sub field that needs further parsing.

The structure of the MSH field is defined using the above SQL Create View statement. Parsing of the main records by pipe has been performed via the view using standard SQL column mapping. The result is shown below:

An interesting segment of the record is the MESSAGE_TYPE field as it contains a nested record structure of Main and Sub message types. This is denoted in the following screenshot by the values ADT^01 and ADT^04, where the caret (^) symbol is used to delimit the two components.

MESSAGE_TYPE needs to be parsed. This is done using the same pattern as in the above example by creating a new view called HL7_MSH_MESG_TYPE. The SQL for Snowflake would be:

create or replace view DAVID_MILLMAN.HL7.HL7_MSH_MESG_TYPE(	
	_FILE,	
  MODIFIED,	
  _LINE,	
  RECORD_NAME,	
  MAIN_MSG_TYPE,	
  SUB_MSG_TYPE
) as SELECT 
_FILE, MODIFIED, _LINE, RECORD_NAME, REPLACE(get(SPLIT(MESSAGE_TYPE, '^'),0),'"'), 
REPLACE(get(SPLIT(MESSAGE_TYPE, '^'),1),'"') FROM HL7_SEGMENT_MSH;

This statement will need to be modified for other CDPs:

Statement Platform
element_at(SPLIT(MESSAGE_TYPE, '[^]'),1), Databricks
SPLIT(MESSAGE_TYPE, '^')[SAFE_OFFSET(0)], Google Big Query
REPLACE(get(SPLIT(MESSAGE_TYPE, '^'),0),'"'), Snowflake
SPLIT_PART(MESSAGE_TYPE, '^',1), Redshift

The result is shown below.

ETL vs ELT

Early binding solutions, such as ETL solutions, will reject data as it is loaded in the database. Late binding solutions can allow data to be parsed once in the CDP landing zone. This is especially important in analytic/innovation style use-cases. In the previous screenshot of the HL7_MSH_MESG_TYPE, note that there are no other segment types listed, such as PID, NK1 and PV1. This is also an advantage of late binding whereas all the data is still kept in the INITIAL table and can be extracted as needed. The upshot is that data can be loaded with minimal impact where breakages can be researched and corrected within the CDP.

Fivetran’s ASCII file ingestion connectors (FTP, S3, Google Drive, etc) provide simple and speedy solutions to assist you in processing complex file types such as EDI and HL7 by harnessing the power of the CDP and reinforcing the ELT paradigm in your Modern Data Stack.

In the next and final installment of this series we will discuss how to productionize the SQL we have written for ASCII files using dbt™.

Related blog posts

Automated data integration for EDI files
Data insights

Automated data integration for EDI files

Read post
Automated data integration for CSV files
Data insights

Automated data integration for CSV files

Read post
 How to go from spreadsheets to a modern data stack with limited resources
Data insights

How to go from spreadsheets to a modern data stack with limited resources

Read post
No items found.
Navigating data lake challenges: Governance, security and automation
Blog

Navigating data lake challenges: Governance, security and automation

Read post
Fully managed data integration is finally here
Blog

Fully managed data integration is finally here

Read post
Why data centralization matters for retail
Blog

Why data centralization matters for retail

Read post

Start for free

Join the thousands of companies using Fivetran to centralize and transform their data.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.