10

Logically, it is (but logic is irrelevant whenever character encodings or locales are in play). According to

perl -e 'print "\n" =~ /\v/ ? "y\n" : "n\n";'

printing "y", it is. According to

Pattern.compile("\\v").matcher("\n").matches();

returning false in java, it's not. This wouldn't confuse me at all, if there weren't this posting claiming that

Sun’s updated Pattern class for JDK7 has a marvelous new flag, UNICODE_CHARACTER_CLASS, which makes everything work right again.

But I'm using java version "1.7.0_07" and the flag exists and seems to change nothing at all. Moreover, "\n" is no newcomer to Unicode but a plain old ASCII character, so I really don't see how this difference may happen. Probably I'm doing something stupid, but I can't see it.

Community
  • 1
  • 1
maaartinus
  • 40,991
  • 25
  • 130
  • 292
  • 3
    As best as I can tell, Unicode doesn't have a vertical whitespace property. It's purely a Perl construct that matches the following characters: U+000A, U+000B, U+000C, U+000D, U+0085, U+2028 and U+2029. Just use a character class matching those characters instead. – ikegami Sep 05 '12 at 22:53
  • 1
    @ikegami: Funny. I've just found [this list](http://unicode.org/Public/UNIDATA/PropList.txt) agreeing with you. – maaartinus Sep 06 '12 at 02:31
  • This question has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Escape Sequences". – aliteralmind Apr 10 '14 at 01:06
  • Note that since Java 8, `\v` means vertical whitespace – Haozhun Mar 25 '15 at 23:22

2 Answers2

20

Java 7's Javadoc for java.util.regex.Pattern explicitly mentions \v in its "list of Perl constructs not supported by this class". So it's not that \n doesn't belong to Java's category of "vertical whitespace"; it's that Java 7 doesn't have a category of "vertical whitespace". Instead, Java 7 regexes have an undocumented feature whereby they interpret \v as referring to the vertical tab character, U+000B. (This is a traditional escape sequence from C/C++/Bash/etc., though Java string literals don't support it. Likewise with \a for alert/bell and \cX for control-character X.)

Edited to add: This has changed in newer versions of Java. According to Java 8's Javadoc for java.util.regex.Pattern, \v now means "A vertical whitespace character: [\n\x0B\f\r\x85\u2028\u2029]".

ruakh
  • 156,364
  • 23
  • 244
  • 282
  • That's true and something I should have spotted myself. However, unlike many other undefined constructs like e.g. `Pattern.compile("\\C")` it throws no `PatternSyntaxException`. In the source code I've finally found that it matches `U+000B`, i.e. "vertical tab" only. Sounds funny. – maaartinus Sep 05 '12 at 22:06
  • @maaartinus: `\v` is a traditional escape sequence for vertical tab (in the same group as `\n`, `\r`, and so on), and although Java doesn't support it in string literals (per section 3.10.6 of the JLS), there are a few similar non-Java escape sequences that `java.util.regex.Pattern` supports (`\a` for alert/bell, `\cX` for control-character `X`). The only funny business here, IMHO, is the mismatch between documentation and implementation: the Javadoc for `Pattern` lists all the escape sequences it's supposed to support, including `\n` and so on, and it doesn't mention `\v`. – ruakh Sep 05 '12 at 22:38
  • 1
    That's it. I think I add it to your answer as this was the thing that confused me. – maaartinus Sep 05 '12 at 23:51
  • As mentioned in a comment to OP: since Java 8, `\v` and `\V` are supported: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html – IARI Jun 12 '20 at 10:24
  • 1
    @IARI: Thanks for the heads-up. I've now updated the answer to explain that. – ruakh Jun 12 '20 at 17:15
9

perldoc perlrecharclass says that \v matches a "vertical whitespace character". This is further explained:

"\v" matches any character considered vertical whitespace; this includes the platform's carriage return and line feed characters (newline) plus several other characters, all listed in the table below. "\V" matches any character not considered vertical whitespace. They use the platform's native character set, and do not consider any locale that may otherwise be in use.

Specifically, \v matches the following characters in 5.16:

$ unichars -au '\v'           # From Unicode::Tussle
 ---- U+0000A LINE FEED
 ---- U+0000B LINE TABULATION
 ---- U+0000C FORM FEED
 ---- U+0000D CARRIAGE RETURN
 ---- U+00085 NEXT LINE
 ---- U+02028 LINE SEPARATOR
 ---- U+02029 PARAGRAPH SEPARATOR

You could use a character class to get the same effect as Perl's \v.

Of course this applies to Perl; I don't know whether it applies to Java.

ikegami
  • 322,729
  • 15
  • 228
  • 466
Keith Thompson
  • 230,326
  • 38
  • 368
  • 578