6

While answering another question, I wrote a regex to match all whitespace up to and including at most one newline. I did this using negative lookbehind for the \R linebreak matcher:

((?<!\R)\s)*

Afterwards I was thinking about it and I said, oh no what if there is a \r\n? Surely it will grab the first linebreakish character \r and then I will be stuck with a spurious \n on the front of my next string, right?

So I went back to test (and presumably fix) it. However, when I tested the pattern, it matched an entire \r\n. It does not match only the \r leaving \n as one might expect.

"\r\n".matches("((?<!\\R)\\s)*"); // true, expected false

However, when I use the "equivalent" pattern mentioned in the documentation for \R, it returns false. So is that a bug with Java, or is there a valid reason why it matches?

Community
  • 1
  • 1
Patrick Parker
  • 4,381
  • 3
  • 15
  • 43

2 Answers2

5

The construct \R is a macro that surrounds the sub expressions into an atomic group (?> parts ).

That's why it won't break them apart.

A note: If Java accepts fixed alternations in a lookbehind, using \R is ok, but if the engine doesn't, this would throw an exception.

muru
  • 4,232
  • 30
  • 69
  • So in your opinion should the [documentation](https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html) say `"is equivalent to (?>\\u000D\\u000A|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029])` ? or is it being a macro self-evident ? – Patrick Parker Feb 26 '17 at 21:58
  • 1
    @PatrickParker - I think it's actually more like `(?>\r\n|\n|\r|[stuff])`. but yeah, that is substituted into the regex string, then reparsed as exactly that. In other words, it parses `\R`, does the substitute, then parses the substitute. That means it is hardcoded into the engine. –  Feb 26 '17 at 22:02
  • You have mostly answered my question but it is still a little unclear to me why an atomic group would prevent the negative lookbehind from seeing the `\r`. In my mind it is advancing one `\s` character at a time and looking back. If you could elaborate on that at all that would be great. – Patrick Parker Feb 26 '17 at 22:14
  • 1
    @PatrickParker - _Atomic Groups_ are atomic. That means it's sub expressions can't be broken up, ie. backtracked into. Once it finds a match, the group is done, it returns a true/false condition. Since the `\r\n` is always first among the alternation's it always will match (if it finds it) then return true. With _Assertions_ it's the same thing. Essentially, it never gets past the `\r` in `\r\n`. –  Feb 26 '17 at 22:19
  • 1
    It looks like Java doesn't support `\R` from Casimir's post. But I'm going to leave my post up as a historical record of what the `\R` construct really is. If it did support it, you should have matched the first `\r` because there is nothing before it and failed on the `\n` since there is `\r` before that which is matched by `\R`. It didn't so, some thing is up. More than likely it is a bug. –  Feb 26 '17 at 22:35
  • 1
    @sln: No my fault, `\R` is available since Java8 – Casimir et Hippolyte Feb 26 '17 at 22:46
  • Now I think I understand; a lookbehind is not only able to look backwards, but can include and even jump over the current position. – Patrick Parker Feb 26 '17 at 22:51
  • 1
    I remember a post on SO about a bug in the parser that is not able to see in some situations that the subpattern in the lookbehind is variable length (and not limited). Perhaps it is an other bug. – Casimir et Hippolyte Feb 26 '17 at 22:57
  • 1
    @PatrickParker - Indeed: `\n(? –  Feb 26 '17 at 23:03
3

Realization #1. The documentation is wrong

Source: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

Here it says:

Linebreak matcher

...is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

However, when we try using the "equivalent" pattern, it returns false:

String _R_ = "\\R";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true

// using "equivalent" pattern
_R_ = "\\u000D\\u000A|[\\u000A\\u000B\\u000C\\u000D\\u0085\\u2028\\u2029]";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // false

// now make it atomic, as per sln's answer
_R_ = "(?>"+_R_+")";
System.out.println("\r\n".matches("((?<!"+_R_+")\\s)*")); // true

So the Javadoc should really say:

...is equivalent to (?<!\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029])

Update March 9, 2017 per Sherman at Oracle JDK-8176029:

"api doc is NOT wrong, the implementation is wrong (which fails to backtracking "0x0d+next.match()" when "0x0d+0x0a + next.match()" fails)"


Realization #2. Lookbehinds don't only look backwards

Despite the name, a lookbehind is not only able to look backwards, but can include and even jump over the current position.

Consider the following example (from rexegg.com):

"_12_".replaceAll("(?<=_(?=\\d{2}_))\\d+", "##"); // _##_

"This is interesting for several reasons. First, we have a lookahead within a lookbehind, and even though we were supposed to look backwards, this lookahead jumps over the current position by matching the two digits and the trailing underscore. That's acrobatic."

What this means for our example of \R is that even though our current position may be \n, that will not stop the lookbehind from recognizing that its \r is followed by \n, then binding the two together as an atomic group, and consequently refusing to recognize the \r part behind the current position as a separate match.

Note: for simplicity sake I have used terms such as "our current position is \n", however this is not an exact representation of what occurs internally.

Patrick Parker
  • 4,381
  • 3
  • 15
  • 43
  • FYI; In regards to the documentation issue, Java Bug Report [JDK-8176029](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8176029) has been filed. – Patrick Parker Mar 01 '17 at 16:57
  • Just a note. Only in the physical world of _pointers_ can assertions exist at a position.Metaphysically, you can't say assertions exist at a position if this is ever true `(?<=X)(?=Y)`. Since it can be true, assertions exist _between_ positions never at a position. –  Mar 18 '17 at 05:01
  • Also, for awareness, an assertion _is_ by nature _atomic_ in that it can't be backtracked into from external to it's bounds. It is by definition atomic. The difference is that, on its face, the atomic group consumes (by default, but depends on it's _internal_ constructs), whereas an assertion does not consume. By consume, I mean advance the current target position. It is always important to perceive _assertions_ as a construct that exists _between_ character positons, never at a position. –  Mar 18 '17 at 05:12
  • @sln - your note is a bit confusing. if an assertion is by nature atomic, then how do you account for the two different results in my example code? also, take a look at the JDK bug report. They did decide that the behavior was a bug afterall. – Patrick Parker Mar 18 '17 at 07:40
  • First, what I mean by assertions being atomic can be seen here `"\r\n".matches( "(?=(\\r\\n|\\r|\\n))\\1(? –  Mar 18 '17 at 17:30
  • Second, let's look at the original question. Since your target is only `\r\n`, we can reduce the regex to it's simplest form. We let _R_ in it's negative lookbehind form be `(? –  Mar 18 '17 at 17:50
  • Clarification - what I mean by `it can't backtrack into the first assertion from an external frame` is that assertions exist as an independent _stack frame_. Each time run, they must be executed _left to right_, i.e. they can't bypass a section via any external condition. –  Mar 18 '17 at 18:05
  • 1
    @PatrickParker: The documentation was wrong, and the "fix" is wrong, see https://stackoverflow.com/a/47879236/371250 – ninjalj Dec 20 '17 at 10:33