8

I know that java regex does not support varying length look-behinds, and that the following should cause an error

(?<=(not exceeding|no((\\w|\\s)*)more than))xxxx

but when the * is replaced with a fixed length specifier as such

(?<=(not exceeding|no((\\w|\\s){0,30})more than))xxxx

it still fails. Why is this?

user2559503
  • 269
  • 3
  • 15
  • 1
    What's the exact regex that you've tried. The one you mentioned above? – Swapnil Jul 21 '14 at 20:47
  • Lookbehinds need to be zero-width, thus quantifiers are not allowed – Braj Jul 21 '14 at 20:47
  • @Swapnil There are a few more keywords within the lookbehind that I removed for simplicity and the xxxx is a placeholder for a longer expression, but I've tested that part and it is not the problem – user2559503 Jul 21 '14 at 20:50
  • check it [here](http://regex101.com/r/mA2hI5/4) – Braj Jul 21 '14 at 20:52

4 Answers4

12

Java Lookbehind is Notoriously Buggy

So you thought Java did not support infinite lookbehind?

But the following pattern will compile!

(?<=\d+)\w+

...though in a Match All it will yield unexpected results (see demo).

On the other hand, you can with success use this other infinite lookbehind (which I found with great surprise on this question)

(?<=\\G\\d+,\\d+,\\d+),

to split this string: 0,123,45,6789,4,5,3,4,6000

It will correctly output (see the online demo):

0,123,45
6789,4,5
3,4,6000

This time the results are what you expect.

But if you tweak the regex the slightest bit to obtain pairs instead of triplets, with (?<=\\G\\d+,\\d+),, this time it will not split (see the demo).


The bottom line

Java lookbehind is notoriously buggy. Knowing this, I recommend you don't waste time trying to understand why it does something that is undocumented.

The decisive words that drove me to this conclusion some time ago are those from Jan Goyvaerts, who is a co-author of The Regex Cookbook and an arch-regex-guru who has created a terrific regex engine and needs to stay on top of most regex flavors under the sun for his debugging tool RegexBuddy:

Java has a number of bugs in its lookbehind implementation. Some (but not all) of those were fixed in Java 6.

Community
  • 1
  • 1
zx81
  • 38,175
  • 8
  • 76
  • 97
  • I decided to give bounty to you for being voice of the reason: "don't waste time trying to understand bug". While it is interesting what went wrong, it is not worth spending too much time figuring out things if you already have working alternative solution. – Pshemo Jul 29 '14 at 11:23
  • @Pshemo Thank you for the bounty. I went down the same road some time ago and asked JG about amazing JS lookbehind behavior, and came to the same conclusion as you... In the end, if it's buggy, what can we do? – zx81 Jul 29 '14 at 12:02
4

That is indeed strange. I don't find explanation but problem seems to disappear if you change (\\w|\\s){0,30} to [\\w\\s]{0,30}

Pattern.compile("(?<=(not exceeding|no([\\w\\s]{0,30})more than))xxxx");
//BTW you don't need ^-----------------------------------------^ these parenthesis
//unless you want to use match from this group later
Pshemo
  • 113,402
  • 22
  • 170
  • 242
  • as per OP both are failed. I think both way are variable length. – Braj Jul 21 '14 at 20:50
  • That works for me too. Strange. I've used parentheses in this exact context many times before and it's never failed, but this one seems to need the brackets – user2559503 Jul 21 '14 at 20:54
  • @user2559503 Look-behind in Java is indeed strange creature. Sometimes it can't figure out max length even if it is obvious (like in this case) and sometimes it lets use unlimited regex (like in [this answer](http://stackoverflow.com/questions/16485687/extracting-pairs-of-words-using-string-split/16486373#16486373)). – Pshemo Jul 21 '14 at 20:58
  • "unless you want to match from this group later" If the lookbehind a group anyway, you definitely don't need those extra parens. Haven't tested that... – aliteralmind Jul 21 '14 at 21:03
  • 1
    @aliteralmind But look-behind is not counted as capturing group so we can't get part matched by it without surrounding regex inside of it with parenthesis. – Pshemo Jul 21 '14 at 21:06
  • @user2559503 You probably shouldn't accept my answer because it doesn't explain your problem, but just provides way around (which I am not sure why even works while code from your question doesn't). I would leave this question without acceptance mark until someone will actually provide confirmed logical explanation of your problem. Give it a few days maybe, I can try to add bounty to this question later so maybe someone will figure it out. – Pshemo Jul 21 '14 at 23:13
3

java regex does not support varying length look-behinds

It is not totally true, Java supports limited variable length lookbehinds, example (?<=.{0,1000}) is allowed or something like (?<=ab?)c or (?<=abc|defgh).

But if there is no limit at all, Java doesn't support it.

So, what is not obvious for the java regex engine for a lookbehind subpattern:

a {m,n} quantifier applyed to a non-fixed length subpattern:

(?:abc){0,1} is allowed

(?:ab?)?     is allowed
(?:ab|de)    is allowed
(?:ab|de)?   is allowed

(?:ab?){0,1}   is not allowed
(?:ab|de){1}   is not allowed
(?:ab|de){0,1} is not allowed # in my opinion, it is because of the alternation.
                              # When an alternation is detected, the analysis
                              # stops immediatly

To obtain this error message in this particular kind of cases, you need two criterae:

  • a potentially variable length subpattern (ie: that contains a quantifier, an alternation or a backreference)

  • and a {m,n} type quantifier.

All these cases don't seem evident for the user, since it seems like an arbitrary choice. However, I think that the real reason is to limit the pre-analysis time of the pattern by the regex engine transmission.

Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113
  • This was the right answer for me. I just changed a '+' in the lookbehind to something like '{1,100}' and things worked as expected. – Randall T. Apr 18 '19 at 19:26
0

Below are some test cases (I removed the redundant parens, as mentioned by @Pshemo). It only fails where the lookbehind contains a sub-alternation. The error is

Look-behind group does not have an obvious maximum length near index 45

"Obvious" being the keyword here.

   import  java.util.regex.Pattern;
public class Test  {
   public static final void main(String[] ignored)  {
      test("(?<=not exceeding|no)xxxx");
      test("(?<=not exceeding|NOT EXCEEDING)xxxx");
      test("(?<=not exceeding|x{13})xxxx");
      test("(?<=not exceeding|x{12}x)xxxx");
      test("(?<=not exceeding|(x|y){12}x)xxxx");
      test("(?<=not exceeding|no(\\w|\\s){2,30}more than)xxxx");
      test("(?<=not exceeding|no(\\w|\\s){0,2}more than)xxxx");
      test("(?<=not exceeding|no(\\w|\\s){2}more than)xxxx");
   }
      private static final void test(String regex)  {
         System.out.print("testing \"" + regex + "\"...");
         try  {
            Pattern p = Pattern.compile(regex);
            System.out.println("Success");
         }  catch(Exception x)  {
            System.out.println(x);
         }

      }
}

Output:

testing "(?<=not exceeding|no)xxxx"...Success
testing "(?<=not exceeding|NOT EXCEEDING)xxxx"...Success
testing "(?<=not exceeding|x{13})xxxx"...Success
testing "(?<=not exceeding|x{12}x)xxxx"...Success
testing "(?<=not exceeding|(x|y){12}x)xxxx"...java.util.regex.PatternSyntaxException: Look-behind group does not
 have an obvious maximum length near index 27
(?<=not exceeding|(x|y){12}x)xxxx
                           ^
testing "(?<=not exceeding|no(\w|\s){2,30}more than)xxxx"...java.util.regex.PatternSyntaxException: Look-behind
group does not have an obvious maximum length near index 41
(?<=not exceeding|no(\w|\s){2,30}more than)xxxx
                                         ^
testing "(?<=not exceeding|no(\w|\s){0,2}more than)xxxx"...java.util.regex.PatternSyntaxException: Look-behind g
roup does not have an obvious maximum length near index 40
(?<=not exceeding|no(\w|\s){0,2}more than)xxxx
                                        ^
testing "(?<=not exceeding|no(\w|\s){2}more than)xxxx"...java.util.regex.PatternSyntaxException: Look-behind gro
up does not have an obvious maximum length near index 38
(?<=not exceeding|no(\w|\s){2}more than)xxxx
                                      ^
aliteralmind
  • 18,274
  • 16
  • 66
  • 102