Remove EDIFACT messages from string in Python

Question

A sample EDIFACT message looks like this:

UNB+AHBI:1+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
IFT+3+XYZCOMPANY AVAILABILITY'
ERC+A7V:1:AMD'
IFT+3+NO MORE FLIGHTS'
ODI'
TVL+240493:1000::1220+FRA+JFK+DL+400+C'
PDI++C:3+Y::3+F::1'
!ERC+21198:EC'
APD+74C:0:::6++++++6X'
TVL+240493:1740::2030+JFK+MIA+DL+081+C'
PDI++C:4'
APD+EM2:0:1630::6+++++++DA'
UNT+13+1'
UNZ+1+1'

I need to create a regex that removes this type of EDIFACT messages from strings. It should not lose any extra text from string as it may contain some important information. For example, edifact can be embedded in text like:

After discussing with team we found that wrong org segment sent in edifact message. Can you please investigate further why wrong ORG segment is sent. [EDIFACT MESSAGE]
Update information as quickly as possible

Can anybody help create a regex for that?

see https://stackoverflow.com/help/how-to-ask and https://stackoverflow.com/help/mcve ... 1) reduce the sample size, few lines to indicate the problem is enough... 2) add complete expected output for given sample 3) add what you've tried yourself to solve it — Sundeep, Apr 04 '18 at 05:12
Sorry sir, If I was unable to make the question clear to you. @Sundeep — Deepak Aggarwal, Apr 04 '18 at 05:41
In what way are they embedded in strings? Are they always in plaintext and does an EDIFACT message always begin on its own line and contain no whitespace indentation at the beginning? — Simon Shine, Apr 04 '18 at 06:32
@SimonShine Yes sir, they are always in plain text. It may or may not begin in new line and can contain white space indentation at the beginning. — Deepak Aggarwal, Apr 04 '18 at 06:50

Simon Shine · Accepted Answer · 2018-04-04T07:23:23.403

Going over an EDIFACT format description, the UNA part is optional and the UNB is mandatory, so either may indicate the start of a message. The UNZ part is a mandatory footer. Considering a file that contains

First
UNA:+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
...
UNZ+1+1'
Message
Second
UNB+AHBI:1+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
...
UNZ+1+1'
Message

with ...s comparable to your full example, here's some Python 3 code:

import re
import sys

regex = re.compile(r'(?:UNA.*?)?UNB.*?UNZ.*?(?:\r\n|\r|\n)', flags=re.DOTALL)
print(re.sub(regex, '', sys.stdin.read()), end='')

Here I assume that the UNZ part continues until the end of line, even though that may be inaccurate. That is, it also appears to have a fixed format that one could more precisely model.

The run-down of the regex itself:

(?:UNA.*?)? is an optional UNA part; the part that comes after UNA may have any size or format, but should be as small as possible.
UNB.*? is a mandatory UNB part; this marks the beginning of the EDIFACT message and continues for as long as it has to until the first occurrence of UNZ.
UNZ.*?(?:\r\n|\r|\n) is a mandatory UNZ part; it is followed by as many characters as it takes to reach the end of the line. Since this appears to be a rather old format, being conservative about the type of line endings is probably a good thing. (\r\n is Windows, and a lot of network protocols honor this for compatibility reasons, \r alone are really old Macs, and \n is Unix).
The flags=re.DOTALL part tells Python's regex engine to include newlines as part of ".".

Running this script here gives:

First
Message
Second
Message

Thanks, it really helped :) – Deepak Aggarwal Apr 04 '18 at 07:07 — Deepak Aggarwal, Apr 04 '18 at 07:07

Remove EDIFACT messages from string in Python

1 Answers1