why does this python regex fail

Question

import sys
import os
import re
import numpy as np
#Tags to remove, sample line:  1:one:2:two:....:122:twentytwo:....:194:ninetyfour:....
r122 = re.compile(':122:(.):')
r194  = re.compile(':194:(.):')

if len(sys.argv) < 2 :
    sys.exit('Usage: python %s <file2filter>' % sys.argv[0])
if not os.path.exists(sys.argv[1]):
    sys.exit('ERROR: file %s not found!' % sys.argv[1])
with open (sys.argv[1]) as f:
    for line in f:
        line = re.sub(r':122:(.):', '', str(line))
        line = re.sub(r':194:(.):', '', str(line))
        print(line,end=" ")

In

1:one:2:two:....:122:twentytwo:....:194:ninetyfour:....

Out

1:one:2:two:....:122:twentytwo:....:194:ninetyfour:....

the tags 122 and 194 are not removed. what am i missing here ?

I want to remove :122:twentytwo: and :194:ninetyfour: from the lines in the file — user1409254, Apr 30 '20 at 17:31
So, you need to replace `(.)` with `[^:]+` in your patterns. And you need just one, `with open (sys.argv[1], 'r') as f:` and then `for line in f: print(re.sub(r':1(?:22|94):[^:]+:', '', line))` — Wiktor Stribiżew, Apr 30 '20 at 17:33
`(.)` only matches one character between `:122:` and `:`, but `twentytwo` is longer than 1 character. — Barmar, Apr 30 '20 at 17:33
Why did you put the `.` in a capture group? You never reference it in the replacement string. — Barmar, Apr 30 '20 at 17:33
Are you sure you want to remove the `:` at the beginning and end? After the replacement you won't have `:` between the fields that were around the removed field. — Barmar, Apr 30 '20 at 17:35
I want to remove the leading : How to make this explicit and not use the fact that the first char is 1 ? sub(r':1(?:22|94): that way I can add more tags say 945 instead of 122 — user1409254, Apr 30 '20 at 17:51
Do you mean you need `re.sub(r':((?:122|194):[^:]+:)', r'\1', line)`? See https://regex101.com/r/FFECwg/1. Or just what I posted before, ``re.sub(r':(?:122|194):[^:]+:', '', line)``? — Wiktor Stribiżew, Apr 30 '20 at 17:53
Perfect. Thank you very much Wiktor, Thanks also for the regex101 link. — user1409254, Apr 30 '20 at 18:00
Please review my answer below, let me know if it is all you need. — Wiktor Stribiżew, Apr 30 '20 at 18:03
Beyond the responses above, I suggest you google for a language-specific web site that will run your RegEx in real-time and allow you to try things many times faster than cycling through code. Additionally, these web site typically show you what is matched in ways your code doesn't show, so you would have immediately seen that you were only matching one character. — Frank Merrow, Apr 30 '20 at 21:37

score 1 · Accepted Answer · answered Apr 30 '20 at 18:01

Your patterns contain (.) that matches and captures any single char other than a line break char. What you want is to match any chars other than :, so you need to use [^:]+.

You do not need to compile separate regex objects if only a part of your regex changes. You may build you regex dynamically abd compile once before reading the file. E.g. you have 122, 194 and 945 values to use in :...:[^:]+: pattern in place of ..., then you may use

vals = ["122", "194", "945"]
r = re.compile(r':(?:{}):[^:]+:'.format("|".join(vals)))
# Or, using f-strings
# r = re.compile(rf':(?:{"|".join(vals)}):[^:]+:')

The regex will look like :(?:122|194|945):[^:]+::

: - a colon
(?:122|194|945) - a non-capturing group matching 122, 194 or 945
: - a colon
[^:]+ - 1+ chars other than a :
: - a colon

Then use

with open (sys.argv[1], 'r') as f: 
    for line in f:
        print(r.sub('', line))

why does this python regex fail

1 Answers1