4

While running some tests for this answer, I noticed the following unexpected behavior. This will remove all occurrences of <tag> after the first:

var input = "<text><text>extra<words><text><words><something>";
Regex.Replace(input, @"(<[^>]+>)(?<=\1.*\1)", "");
// <text>extra<words><something>

But this will not:

Regex.Replace(input, @"(?<=\1.*)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>

Similarly, this will remove all occurences of <tag> before the last:

Regex.Replace(input, @"(<[^>]+>)(?=.*\1)", "");
// extra<text><words><something>

But this will not:

Regex.Replace(input, @"(?=\1.*\1)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>

So this got me thinking…

In the .NET regular expression engine, does a backreference need to appear after the group it's referencing? Or is there something else going on with these patterns that's causing them not to work?

Community
  • 1
  • 1
p.s.w.g
  • 136,020
  • 27
  • 262
  • 299
  • Well logically speaking, you need to capture something first and then use it as backreference otherwise recursive regexes [like this one](http://stackoverflow.com/questions/18262551/can-the-for-loop-be-eliminated-from-this-piece-of-php-code/18262967#18262967) would fail :) PS: it's not only in .net, I think it's the case in all flavors. See a demo in [php pcre](http://regex101.com/r/zR9jR2) – HamZa Aug 20 '13 at 20:10
  • 1
    @HamZa Thanks, that's a good point. I would've thought lookaround assertions *might* be different because (as I understand it) they must be evaluated after the matched section of the string. In other words, it must find `` before it can check the assertion--but I don't know too much about the internals of regex so I might be wrong about that. Also, I suppose in most cases it would be *more* surprising if the engine's behavior depended on the order in which it was evaluated rather than the order it appeared in the pattern. – p.s.w.g Aug 20 '13 at 20:19

1 Answers1

4

Your question got me thinking too, so I ran a few tests with RegexBuddy and to my surprise the second regex (?<=\1.*)(<[^>]+>) which you said didn't work actually worked and the others worked exactly like you said. I then tried the same expression - the second one - in C# code but it didn't work like what happened with you.

This got me confused, then I noticed that my RegexBuddy version dates back to 2008 so there must have been some change in how the .NET engine works, but this shed the light on a fact I though is rational, it seems that before 2008 lookbehinds were evaluated after the rest of the expression matched. I felt this behavior is a bit acceptable with lookbehinds since you need to match something before you look behind to match something before it.

Nevertheless, the engines these days seem to evaluate lookarounds when it encounters them and I was able to find this out by using the following expression which is like the reverse situation of your case:

(?<=(\w))\1

As you can see I captured a word character inside the regex and referenced it outside it. I tested this on the string hello and it matched at the second l character as expected and this proves that the lookbehind was executed before attempting to match the rest of the expression.

Conclusion: Yes, a back reference need to appear after the group it references or it will have no match semantics.

Ibrahim Najjar
  • 18,190
  • 4
  • 65
  • 91
  • 1
    The error message is telling you `(?<=\1)\w` is invalid syntax because it contains a reference to a capturing group that doesn't exist. `(?<=\1)(\w)` is valid syntax, but it will never succeed because it's trying to match the contents of the group before that group can participate in the match. – Alan Moore Aug 21 '13 at 12:12
  • @AlanMoore You are of course correct, mistake corrected. Thank you. – Ibrahim Najjar Aug 21 '13 at 13:38
  • Thanks for taking the time to respond. Actually I get no exception if it appears before the group; it just doesn't match anything. I had considered that perhaps a back reference match an empty string if the group it references hadn't been captured yet, but apparently it doesn't match anything at all, e.g. `Regex.Matches("hello", @"\1()")` matches nothing, but `Regex.Matches("hello", @"(?!\1)()")` matches a 0-length string around every character. – p.s.w.g Aug 21 '13 at 16:06
  • @p.s.w.g I am sorry for the exception statement, Alan mentioned this in his comment but I must have forgotten to fix the conclusion part. You are correct that a back reference doesn't match anything if the group it references hasn't been matched yet. Nevertheless regarding the main point of the argument which is whether a lookaround is matched after or before the rest of the expression, it is safe to say that it is matched when encountered immediately. – Ibrahim Najjar Aug 21 '13 at 22:59