0

I apologize for the poorly worded question.

I have a large number of strings like:

"ODLS_ND33283633__PS1185"

Which the first letters up to the first "_" are a header and the remainder (ND33283633__PS1185) is a unique ID.

I wrote a regex in python trying to remove everything up to the first "_" desiring

"ND33283633__PS1185"

as the end result.

I figured something like:

.*_? or .+?_

Would do the trick, but that was not the case...

I kept trying to write various regex unsuccessfully to accomplish this and finally went online and found another person's answer I was able to use as an example to rewrite as:

^[^_]+_

Which gave me my desired result, but now I have questions which I can't figure out the answer for:

I found that removing the "^" at the front and writing it as:

[^_]+_

caused the regex to remove everything up to the second "_" so the resulting string was:

"_PS1185"

I understand that "^" identifies as the beginning of the line, but I would like to know why not including it removes up to the second without the "^" at the front?

My understanding is that [^_]+ matches characters NOT equal to "_" 1 or more number of times, so why would including the "^" at the beginning cause it to stop at the first, while excluding it causes it to stop at the second?

Another thing, when I replaced the "+" symbol with a "*":

[^_]*_

I expected the same result but instead got:

PS1185

I thought that * matches 0 or more, while + matches 1 or more, so they're effectively the same except + is supposed to be more 'strict'. However, seeing these results makes me feel like I don't fully understand how regex is behaving. Is there anyone here that can please explain what is actually going on?

Ben C Wang
  • 537
  • 6
  • 16
  • `.+?` is **not** equal to `.*`... – Willem Van Onsem Oct 12 '17 at 17:53
  • 1
    When you remove `^` then `re.sub` is making 2 substitutions first `ODLS_` and then `ND33283633_` – anubhava Oct 12 '17 at 17:53
  • @anubhava Yes, but why is the '^' being used to tell the regex to make 1 replacement versus 2? I thought '^' is only an identifier for the start of a line, not a condition for the number of replacements to be made. – Ben C Wang Oct 12 '17 at 17:57
  • 1
    `^` enforces that regex matches only starting `[^_]*_` not anywhere else in the input. – anubhava Oct 12 '17 at 17:58
  • I see, so "^" is used as a 'match only the first case' condition in regex. Thanks for the explanation. – Ben C Wang Oct 12 '17 at 18:05
  • in Perl I'd do something like this: `my $id = s/.*?_//;`. PCRE language-agnostic example: https://regex101.com/r/9w1R2Y/1/ – a1111exe Oct 12 '17 at 19:13

0 Answers0