HL7 v2X and v3 data modeling

Question

The company I work for has started a new initiative in HL7 where we are trading both v2X and v3 (CDA specifically) messages. I am at the point where I am able to accept, validate and acknowledge the messages we are receiving from our trading partners and have started to create a data model for the backend storage of said messages. After a lot of consideration and research I am at a loss for the best way to approach this in MS SQL Server 2008 R2.

Currently my idea is to essentially load the data into a data warehouse directly from my integration engine (BizTalk) and foregoing a backing, normalized operational database. I have set up the database for v2X messages according to the v2.7 specs as all versions of HL7 v2 are backward compatible (I can store any previous versions in the same database). My initial design has a table for each segment which will tie back to a header table with a guid I am generating and storing at run time. The biggest issue with this approach is the amount of columns in each table and it's something I have no experience with. For instance the PV1 segment has 569 columns in order to accommodate all possible data. In addition to this I need to make all columns varchar and make them big enough to house any possible customization scenario from our vendors. I am planning on using varchar(1024) to achieve this. A lot of these columns (the majority probably) would be NULL so I would use SPARSE columns. This screams bad design to me but fully normalizing these tables would require a ton of work in both BizTalk and SQL server and I'm not sure what I would gain from doing so. I'm trying to be pragmatic since I have a deadline.

If fully normalized, I would essentially have to create stored procs that would have a ton of parameters OR split these messages to the nth degree to do individual loads into the smaller subtables and make sure they all correlate back to the original guid. I would also want to maintain ACID processing which could get tricky and cause a lot of overhead in BizTalk. I suppose a 3rd option would be to use nHapi to create objects out of the messages I could tie into with Entity Framework but nHapi seems like a dead project and I have no experience with Entity Framework as of right now.

I'm basically at a loss and need help from some industry professionals who have experience with HL7 data modeling. Is it worth the extra effort to fully normalize the tables? Will performance on the SQL side be abysmal if I use these denormalized segment tables with hundreds of columns (most of which will be NULL for each row)? I'm not a DBA so I'm trying to understand the pitfalls of each approach. I've also looked at RIMBAA but the HL7 RIM seems like a foreign language to me as an HL7 newbie and translating v2 messages to the RIM would probably take far longer than I have to complete this project. I'm hoping I'm overthinking this and there is a simpler solution staring me in the face. Hopefully this question isn't too open ended.

I guess the biggest question would be, what fields in the HL7 message is your company actually going to use? It sounds like you're trying to capture every HL7 field in a particular message type(I'm not familiar with CDA RIM unfortunately) - but are you going to actively use ALL of the data being sent and is this data going anywhere besides the database you're creating? If you only really care about a few fields of data in the message, then don't capture it all - just capture what you need to process/send back. — SQLSavant, Mar 04 '14 at 17:33
I agree with you, unfortunately this is a new initiative with absolutely no business rules at the current time. I was tasked with taking all the data we receive in and the business would figure out how to use it later. I'm kind of handcuffed. — Ritley572, Mar 04 '14 at 18:18
Well, that doesn't sound like fun. I'd say your best bet then would be to structure the database based on individual segments. Some segments are reused in multiple different message types/events. This would give it at least a structured feel for a relational database. Additionally, I would translate the data coming in into a set-in-stone specification, and try to work with your vendors/partners to send you data that meets these specifications. Either way, you're looking at a spaghetti database, but at least your tables would be structured and would be less prone to incorrect data. — SQLSavant, Mar 04 '14 at 19:07
That's the approach I am taking, which leads me back to the original question, some of those segments have a huge number of fields (IN2 for example in the 2.5 spec has something like 560 fields). Making a table with that many columns might be problematic, looking for guidance on that, or an alternate approach to alleviate SQL issues. — Ritley572, Mar 04 '14 at 20:37
@Ritley572 I am working on HL7v2 message ...could you please suggest what is the best way to store hl7 v2 message...based on your experience. could you please share your design as answer....else i can open a new question if required and you can add answer I feel that converting it to FHIR and saving would require great deal of work and want to avoid that.I am also asked to not loose any data from hl7 v2 message. — Chris_vr, Jun 29 '20 at 16:05

score 3 · Answer 1 · edited Mar 18 '14 at 16:06

3

HL7 is not a "tight" standard inputs and expected outputs vary depending on the system you are talking to. In this case the adding in a broker such as Mirth, Rhaposdy or BizTalk is a very good idea.

What ever solution you employ make sure you can cope with "non standard" input and output as you will soon find things vary. On the HL7 versions 2X and 3 be aware that very few hospitals have the version 3 most still run 2X.

I have been down the road of working with a database that tried to follow the HL7 structure, it can work however it will take time and effort. Given that you have a tight dead line maybe break out the bits of the data you will need to search on and have fields (e.g. PID segment 3 is the patient id would be useful to have) the rest can go in your varchar. Also if you are not indexing on the column you could use varchar(max).

As for your Guids in the database, this can work fine, but be careful not to cluster any indexes using the Guid as this will fragment your data. Do your research here and if in doubt go for identity columns instead.

I'll recommend the entity framework too, excellent ORM, well worth learning.

So my overall advice. Go for a hybrid for now, breaking out what you need. Expect it to evolve over time breaking out the pieces of HL7 into their own areas as needed. Do write a generic HL7 parser (not too difficult I've done it a couple of times) and keep it flexible. But most of all expect the HL7 to vary in structure don't treat the specification as 100% truth you will get variations.

edited Mar 18 '14 at 16:06

Bensonius

1,416
1
14
37

answered Mar 04 '14 at 14:53

HSG

185
1
7

A lot of the time the variations are down to companies not following the standard during implementation, but I too have found this to be the case. – user692942 Mar 04 '14 at 14:57
The very first trading partner that came on board has customized it, this is why I am trying to build a generic data model. This results in very wide tables (for instance the PV1 segment table I've created has 577 columns) to accomodate customizations to the spec from our trading partners. Also with weakly typed (i.e. all varchar) fields. I'm wondering if having tables that wide will cause issues with fragmentation and/or data corruption in SQL Server 2008 R2. – Ritley572 Mar 04 '14 at 15:43
Depending on the size of your data fragmentation may occur, but if most fields are null I wouldn't expect it to be an issue. Data corruption won't be an issue SQL Server it is very solid. Only thing you may need to keep an eye on is what you need to index by, there is a 900 Byte limit for indexes in MS SQL. – HSG Mar 04 '14 at 16:03
@Ritley572 - The Gentleman's Agreement in integrations is the sending system should adhere to the receiving system's specifications. If you require X data to be in Y field then present the sending vendor with a specifications document detailing such. If they refuse, then you'll want to develop a message translation within BizTalk to tailor the incoming message to match your specification before writing the fields to the database. – SQLSavant Mar 04 '14 at 17:41
@HSG, thanks I'll take indexing into consideration if I end up creating these wide tables for each segment. – Ritley572 Mar 04 '14 at 21:16

score 3 · Answer 2 · answered Apr 07 '14 at 03:41

3

In most cases it's a waste of time to try to create a normalized relational data model to persist HL7 V2 or V3 data. I would recommend just storing entire messages or documents as single XML column values. Then query using SQLXML and/or XQuery. All modern relational databases support this now.

answered Apr 07 '14 at 03:41

Nick Radov

404
2
7

I would emphasize the "store in native XML" aspect of this answer. You business case may justify breaking the message into smaller pieces, but the smallest units would still be stored in XML. You could even supplement the XML with additional columns where you extract specific data (for indexing, etc.) – claytond Oct 02 '14 at 15:55

score 2 · Answer 3 · answered Mar 18 '14 at 16:26

I can only comment on the CDA (and some very limited HL7v2) side of things based on personal experience.

We receive and send CDA documents wrapped in HL7v3 wrappers from external vendors (as well as internal systems -- see below). The wrappers contain the metadata for things like sending/receiving systems/dates and other high-level data. The very limited message metadata is stripped and stored in the message data repository. Inside the wrapper, is the actual CDA, which is then taken and stored as XML datatype in the SQL database.

Using this model we can then search at the metadata level, but also narrow it down based on the CDA using Xpath queries. It makes the database much simpler...I can't even imagine creating columns based on the CDA schema.

As for making clients follow the CDA schema, as a part of the project we've created an implementation guide which clients must follow if they want to have their messages accepted.

Using the implementation guide + schematron + BizTalk and XSD validation, we only accept messages which follow the CDA schema. We then check some data fields using schematron validation and reject if any of those fail. This is relayed to the sender using an HL7v3 message back to them with the specific error message and/or fields that are invalid. This is a point at which a message will be stored in the database.

This is all done in BizTalk/SQL Server. And since the CDA schema is very much pre-defined by the HL7 group, you can make the consumers of this system follow the schema. This is unlike what I've seen with HL7v2 where it seems people just bend the schema as needed.

For the HL7v2 side of things, I'm 99% certain that "we" (as in, "my company") are storing the messages much in the same way. Except since since the HL7v2 schema is so open, we're not validating and just accepting/storing all messages. An HL7v2 parser has been written to parse the HL7v2 using the variations of schemas we know about.

In my project's case, we are sending HL7v2 from our HCIS --> Mirth --> BizTalk which then follows the Implementation guide + CDA Schema along with an XSLT transform to map the HL7v2 to CDA THEN submits it to the OTHER BizTalk CDA Submission service as though it was an external vendor.

That's a ton of reading right now, so please ask questions, as I'd like to talk about it.

Thanks for the response LaMMMy. A few questions, what was the business driver for deciding to map the V2 messages into CDA format? Was it just to have a unified database structure to tie patient visit and clinical information together more easily? I understand the ease of storing the CCD's as XML and using XML queries but in my case the business needs more human readable structure for the data and bypassing the very optimized nature of BizTalk's XML transformation and messaging engines and relying on SQL server seems almost like a waste. Could you clarify the benefits of your approach? — Ritley572, Apr 08 '14 at 16:00
To be honest, the driver was the fact we were trying to strive for a standardized schema. While HL7v2 is technically standard, around here, even amongst our inbound interfaces from THE SAME vendor the schemas vary greatly. I wasn't really involved in the decision to use V3, but it was an attempt to get away from the hybridized V2 schemas that always seem to get settled on. So far, it has worked. The downside is, just as you said it's more complicated on the BizTalk side of things. — Bensonius, Jun 12 '14 at 22:56

score 1 · Answer 4 · answered Mar 14 '14 at 10:59

Modeling on HL7 can be a pain.

I would do the following;

use the standards described in HL7 for staging tables, that way even if you have varchar(1024) and they are null it does not hurt you
create your actual table to be populated from the staging table as per the standards that you have enforced or will enforce.

This means that you have 500+ columns from the message but only 10 or 50 make sense, you will need to model only your 50. Yes, this has a lopside, tomorrow you want to make more meaning then it will increase from 50 to 75, the historical messages will not have information; which is fine but you will need to factor into the design.

score 0 · Accepted Answer · answered Apr 01 '14 at 00:01

I would under no circumstances attempt to model anything using the HL7 v3 RIM. The reason is that this schema is very generic, deferring much of the metadata to the message itself. Are you familiar with an EAV table? The RIM is like that.

On the other hand, HL7 v2 should be a fairly simple basis for a DB schema. You can create tables around segment types, and columns around field names.

I think the problem of pulling in everything kills the project and you should not do it. Typically, HL7 v2 messages carry a small subset of the whole, so it would be an utter waste to build out the whole thing, and it would be very confusing.

Further, the version of v2 you model would impact your schemas dramatically, with later versions, more and more fields become repeating fields, and your join relationships would change.

I recommend that you put a stake in the sand and start with v2.4 which is pretty easy yet still more complicated than most interfaces actually in use. Focus on a few segments and a few fields. MSH and PID first.

Add an EAV table to capture what may come in that you don't yet have in your tables. You can then look at what comes into this table over time and use it to decide what to build next. Your EAV could look like this MSG_ID, SEGMENT, SET_ID, FIELD_NAME, FIELD VALUE. Just store the unparsed HL7 contents of the field value.

Very informative stuff, thank you. I think I will try the EAV approach you've mentioned here as that will help the business identify things they may actually need much easier. — Ritley572, Apr 02 '14 at 03:56
There's nothing particularly wrong with using the HL7 V3 [RIM](http://www.hl7.org/implement/standards/product_brief.cfm?product_id=77) for modelling. It's complex but works quite well for the healthcare domain once you understand how all the pieces fit together. Especially if you're storing HL7 V3 based content such as CDA R2 documents then a RIM-Based Application Architecture (RIMBAA) is a natural fit. However RIM is not a good fit for HL7 V2 messages; you'll have to do a large amount of data transformation to go from V2 to RIM/V3. — Nick Radov, Oct 02 '14 at 19:35
The RIM is too generic in my opinion. While a generic schema (like an EAV table) is flexible, and can have it's place, particularly in prototyping as I suggest above, it does not provide a sound basis for design time validation. In other words, the metadata that should, for the purposes of interoperability, reside in the schema, are instead maintained in the data (content). This means that coding to a schema means little. Validation can only be accomplished, indeed interoperability can only be ensured at run time, empirically, based on examples. — dougcl, Oct 04 '14 at 01:23

HL7 v2X and v3 data modeling

5 Answers5