-2

Recently I stumbled upon this weird REGEX, which is a combination of positive and negative lookahead and I can not wrap my head around what does really it do. Keep in mind this is some Java regex syntax.

(?=((?!\bword1\b|\bword2\b).)+?\s*?)
 ^^  ^^

What does those two nested lookaheads do? Can this be simplified?

azro
  • 35,213
  • 7
  • 25
  • 55
Taserface
  • 57
  • 7
  • 2
    A pertinent question is what is it *supposed to do*? You ought to be able to determine that from the context. What I am saying is: don't discount the possibility that the regex is incorrect. If that is the case, doing the same (wrong) thing more efficiently is not the solution. – Stephen C Oct 11 '20 at 12:21

1 Answers1

0
  • . matches if it is not "w" in "word1" or "word2" (can be simplified \bword1\b|\bword2\b\bword[12]\b), between non-words. This is the meaning of the negative assertion,
  • +? means at least one such .,
  • but actually only one, because the quantifier is non-greedy and is followed by \s* that always matches. Therefore+? can be dropped,
  • \s*? in this assertion is meaningless, as it always matches, and consumes no input, and not followed by anything,
  • The positive lookahead assertion (?=...) here means that the position is followed by any character (except for "w" "word", etc. as is described above).

Further simplifications would remove group captures, which could be required in the context.

So, the simplified regex is (?=((?!\bword[12]\b).)). It will succeed before any character of the input, except at the beginning of "word1" or "word2" between non-words. The match will be empty, but the first capture group will contain the following character.

https://regex101.com/r/O10c3u/1

Alexander Mashin
  • 3,126
  • 1
  • 6
  • 13