Regex to process tags with special characters irrespective of order

Question

Python Code

    import re

    def updateRule(rule):
      tokens = rule.split('/')
      return [tokens[0][len('RULE:'):] , tokens[1].replace('$','\\') ]

    def getPX(inputStr,rule):
      reg_match = updateRule(rule)
      match = re.compile(reg_match[0])
      return re.sub(match,reg_match[1],inputStr)


    def main():
      inputStr = "XZ=Rep.com,PX=TE-ST-,PX=Zen,PX=TAG,M=Dana,I=JAR"
      rule= 'RULE:^XZ=[^,]+,(PX=.+),M=Dana,I=JAR$/$1/,DEFAULT'
      print(getPX(inputStr,rule))

    if __name__== "__main__":
      main()

Input Strings/Outputs expected :

Case 1:

    inputStr   =  "XZ=Rep.com,PX=TE-ST-,PX=Zen,PX=TAG,M=Dana,I=JAR"

    Desired output   =  "PX=TE-ST-,PX=Zen,PX=TAG"

Case 2:

    inputStr   = "PX=$#XN,I=JAR,M=Dana,PX=Faber,PX=Module,OU=gif,XZ=dana-fa.com,PX=GAN%"

    Desired output   = "PX=$#XN,PX=Faber,PX=Module,PX=GAN%"

As can be seen we only need PX= followed by corresponding values in the final output.

Case 1 is giving the desired output and works fine, case 2 is giving other values other than PX=.

I don't want to use findall() method but would rather want to change the regex rule in the code to address this issue so that we only see PX= in the final output.

How can we modify the below rule in the code to address this?

    rule= 'RULE:^XZ=[^,]+,(PX=.+),M=Dana,I=JAR$/$1/,DEFAULT'

After lot of research with ( grouping, non-grouping captures etc)

This is the new regex rule I have created

    "[A-Za-z_]+,((?:PX=[A-Za-z$-_ !]+,)+(?:PX=[A-Za-z$-_ !]+,)*).+"

Case 1 works fine with the following output ( with a comma appended in output)

     PX=TEST,PX=Zen,PX=TAG,

Got it working with special characters as well but Case 2 is failing ( because it cannot take PX in any random order , where PX can be in beginning, middle or end ). So PX irrespective of order and comma in the end are the two things to fix in regex rule, suggestions ?

You may have two types of rule: this one is regex replacement rule, and the other one can be matching one based on `re.finditer`. What you ask for requires a bit of effort: please share what you have tried to solve the issue. See also [Why is “Can someone help me?” not an actual question?](https://meta.stackoverflow.com/questions/284236) — Wiktor Stribiżew, Mar 10 '20 at 22:40
Note that the expected results for the first input should be `PX=TE-ST-,PX=Zen,PX=TAG` and not `PX=TE-ST-,PX=AS,PX=DCT` because the string does not contain those substrings. — Wiktor Stribiżew, Mar 10 '20 at 22:44
@WiktorStribiżew , thanks . Have made that correction above. — Maazen, Mar 10 '20 at 23:01
You'd better share your attempt at solving the issue. Else, all we can do is share the [post](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) to help you better understand regex capabilities. — Wiktor Stribiżew, Mar 10 '20 at 23:02
@WiktorStribiżew sir , the code in the question is my attempt( have written it to work for case 1 but it fails for Case 2 ) . Another thing I have tried is 'return 'PX='+',PX='.join(re.findall(r'(?i)PX=(.*?),', inputStr))' . But I cannot use findall method approach as the regex rule is also being used in a system property file and the pruning from script and system property should match. So only way is to change the rule= 'RULE:^XZ=[^,]+,(PX=.+),M=Dana,I=JAR$/$1/,DEFAULT' processing. — Maazen, Mar 10 '20 at 23:09
I understand, but that code of yours is a setup. Your problem is getting the results for the second string. You say you need to get the result using `re.sub`. So, what have you tried to achieve that? That is the programming issue, else, there is none. — Wiktor Stribiżew, Mar 10 '20 at 23:11
@WiktorStribiżew , after lot of research finally came up with this rule : "A-Za-z_]+,((PX=[A-Z]+),(PX=[A-Z]+)*).+" . It does cover some cases in Case 1 and Case 2 but misses out some special character groups. Can you please suggest ? — Maazen, Mar 11 '20 at 15:50
Great, could you please add that to the question, and explain the current problem with this regex? — Wiktor Stribiżew, Mar 11 '20 at 15:56
@WiktorStribiżew , sir I get the following error " File "/usr/lib64/python2.7/sre_parse.py", line 800, in expand_template raise error, "unmatched group" sre_constants.error: unmatched group" when running the program ideone.com/aFfG73, in online python compiler : onlinegdb.com/online_python_compiler. – I see that it is related to older versions of python ( prior to 3.5) , I am using Python 2.7.5. Any generic workaround ? . — Maazen, Mar 11 '20 at 22:07
Try https://ideone.com/w4ohLr. It is highly recommended to migrate to Python 3, Python 2 support has come to an end. — Wiktor Stribiżew, Mar 11 '20 at 22:23
@WiktorStribiżew, ideone.com/w4ohLr for input "PX=$#XN,I=JAR,M=Dana,PX=Faber,PX=Module,OU=gif,XZ=dana-fa.com,PX=GAN%" the output is PX=$#XN,PX=Faber,PX=GAN ( It missed PX=Module and PX=GAN% ) . I am currently using python 3 ( but prior to 3.5) get the same error. — Maazen, Mar 11 '20 at 22:40
@WiktorStribiżew , thanks a lot :) it works ! . you are a regex guru. Any good book or tutorial one can read to develop deeper understanding like yours ? — Maazen, Mar 12 '20 at 03:18
I do not know your level of regex knowledge :) so that I can only suggest doing all lessons at [regexone.com](http://regexone.com/), reading through [regular-expressions.info](http://www.regular-expressions.info), [regex SO tag description](http://stackoverflow.com/tags/regex/info) (with many other links to great online resources), and the community SO post called [What does the regex mean](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean). Also, [rexegg.com](http://rexegg.com) is worth having a look at. — Wiktor Stribiżew, Mar 12 '20 at 07:57
@WiktorStribiżew , sorry somehow missed it . In the solution provided https://ideone.com/Ggf0wD . The first inputStr : "XZ=Rep.com,PX=TE-ST-,PX=Zen,PX=TAG,M=Dana,I=JAR" is missing out commas in the output "PX=TE-ST-PX=ZenPX=TAG" — Maazen, Mar 13 '20 at 14:05
Glad it worked for you. Please consider accepting the answer by clicking ✓ on the left (see [How to accept SO answers](http://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work)). — Wiktor Stribiżew, Mar 13 '20 at 22:03
@WiktorStribiżew , Thanks a lot sir. I have asked a new question related to the regex here : https://stackoverflow.com/questions/60837249/regex-rule-processing-without-global-flag .If you can please take a look ? . — Maazen, Mar 24 '20 at 18:37

Wiktor Stribiżew · Accepted Answer · 2020-03-13T14:29:08.550

You may use

(?s)(PX=[^,]+(?:,(?=.*PX=))?|)(?:(?!PX=).)?

See the regex demo

Replace with \1 (in your code, it is $1). NOTE that the pattern can match an empty string, but since it is used to remove found substrings, it is OK.

Details

(?s) - a re.DOTALL inline flag that makes . match line break chars, too
(PX=[^,]+(?:,(?=.*PX=))?|) - Group 1:
- PX=[^,]+ - PX= and then 1+ chars other than a comma
- (?:,(?=.*PX=))? - an optional comma that is followed with any 0+ chars as many as possible (.*) and then PX=
(?:(?!PX=).)? - an optional sequence: any char, 1 or 0 times, that is not the starting char of a PX= sequence.

See the Python demo:

import re

def updateRule(rule):
  tokens = rule.split('/')
  return [tokens[0][len('RULE:'):] , tokens[1].replace('$','\\') ]

def getPX(inputStr,rule):
  reg_match = updateRule(rule)
  match = re.compile(reg_match[0])
  return re.sub(match,reg_match[1],inputStr)


inputStr = "XZ=Rep.com,PX=TE-ST-,PX=Zen,PX=TAG,M=Dana,I=JAR"
rule= 'RULE:(?s)(PX=[^,]+(?:,(?=.*PX=))?|)(?:(?!PX=).)?/$1/,DEFAULT'
print(getPX(inputStr,rule))
print(getPX("PX=$#XN,I=JAR,M=Dana,PX=Faber,PX=Module,OU=gif,XZ=dana-fa.com,PX=GAN%",rule))

Output:

PX=TE-ST-,PX=Zen,PX=TAG
PX=$#XN,PX=Faber,PX=Module,PX=GAN%

Regex to process tags with special characters irrespective of order

1 Answers1