While running some tests for this answer, I noticed the following unexpected behavior. This will remove all occurrences of <tag>
after the first:
var input = "<text><text>extra<words><text><words><something>";
Regex.Replace(input, @"(<[^>]+>)(?<=\1.*\1)", "");
// <text>extra<words><something>
But this will not:
Regex.Replace(input, @"(?<=\1.*)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>
Similarly, this will remove all occurences of <tag>
before the last:
Regex.Replace(input, @"(<[^>]+>)(?=.*\1)", "");
// extra<text><words><something>
But this will not:
Regex.Replace(input, @"(?=\1.*\1)(<[^>]+>)", "");
// <text><text>extra<words><text><words><something>
So this got me thinking…
In the .NET regular expression engine, does a backreference need to appear after the group it's referencing? Or is there something else going on with these patterns that's causing them not to work?