1377

The following code produces the output "Hello World!" (no really, try it).

public static void main(String... args) {

   // The comment below is not a typo.
   // \u000d System.out.println("Hello World!");
}

The reason for this is that the Java compiler parses the Unicode character \u000d as a new line and gets transformed into:

public static void main(String... args) {

   // The comment below is not a typo.
   //
   System.out.println("Hello World!");
}

Thus resulting into a comment being "executed".

Since this can be used to "hide" malicious code or whatever an evil programmer can conceive, why is it allowed in comments?

Why is this allowed by the Java specification?

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Reg
  • 9,618
  • 6
  • 28
  • 46
  • 3
    Although strange, I don't see this as a real issue. Regular users wouldn't know the difference between code hidden in a comment and regular code so to them it doesn't matter. Then, it might be a team member hiding code from the other members, but developers will react when seeing a strange comment like this, and either delete it or investigate. If this were to get through and put into use, a VCS will tell you which person did it, so one would get caught. – Tobb Jun 09 '15 at 09:08
  • 198
    One interesting thing is at least that **OP's IDE obviously gets it wrong** and displays incorrect highlighting, – dhke Jun 09 '15 at 09:09
  • 14
    Possibly related: http://stackoverflow.com/questions/4448180/why-does-java-permit-escaped-unicode-characters-in-the-source-code – dhke Jun 09 '15 at 09:10
  • 6
    @Tobb: Yes, the authoritative answer can only come from the designers. However, there may be information somehwere on why this was done (compatibility, limitation of tools etc.), so it *is* answerable. – sleske Jun 09 '15 at 09:11
  • 2
    'cause newline character is also allowed ... i had it tested with c++ and c# these languages are skiping the lines after reading // but java seems to parse the line complete and interprete the code as newline char. – Zelldon Jun 09 '15 at 09:12
  • 48
    @Tobb But Java designers [are visiting SO](http://stackoverflow.com/a/23476994/1393766) so it is *possible* to get answers by one of them. Also they may exist resources which already answer this question. – Pshemo Jun 09 '15 at 09:15
  • 2
    I don't know for sure, but I suspect it's just a side effect of the general decision to process unidoce chars inside comments. Perhaps to allow code comments in foreign languages, or with mathematical greek signs. Personally I'd avoid it... (javadoc might be an exception, but then I don't need this feature because HTML has its own support for special chars). – Pelit Mamani Jun 09 '15 at 09:23
  • 2
    http://stackoverflow.com/questions/3866187/why-cant-i-use-u000d-and-u000a-as-cr-and-lf-in-java funny example – Zelldon Jun 09 '15 at 09:49
  • 9
    Unicode escapes are allowed **anywhere** and are always parsed before everything else. The intent is that any source file can be converted to an equivalent file containing only ASCII characters. – user253751 Jun 09 '15 at 11:31
  • 1
    Related: http://stackoverflow.com/q/13116648/319403 – cHao Jun 09 '15 at 11:32
  • 2
    @dhke: This is also displayed as a comment in Eclipse, so do you know any IDE which does **not** display it as a comment? – Thomas Weller Jun 09 '15 at 12:46
  • 1
    @Thomas Netbeans (at least in 8.0.2) terminates the comment after the Unicode escaped newline, showing the `println()` as code. It also shows the same behaviour as the compiler for the escaped comment start code from https://stackoverflow.com/questions/4448180/why-does-java-permit-escaped-unicode-characters-in-the-source-code – dhke Jun 09 '15 at 13:38
  • 5
    This also means that invalid Unicode escapes in comments cause compile errors (such as a windows path continuing `\users`), which can be annoying. – Pokechu22 Jun 09 '15 at 15:20
  • 9
    @dhke The OP didn't mention how his/her IDE displays that code. The only thing about highlighting that we can tell from the question text is that the Java code highlighter here in SO gets it wrong. – Matthias Jun 09 '15 at 18:55
  • 2
    What you are showing is a bug in the IDE. It's perfectly valid code. That the IDE does not SHOW it as code is the bug. IDEs need to stop assuming compilers are not cognizant of unicode. – CuriousRabbit Jun 09 '15 at 21:36
  • 4
    @CuriousRabbit, what makes you draw the conclusion that it's a bug in OP's IDE? (How do you even know OP is *using* an IDE?) – aioobe Jun 09 '15 at 21:41
  • 41
    The simple answer is that the code isn't in a comment at all, by the rules of the language, so the question is ill-formed. – user207421 Jun 09 '15 at 22:54
  • 1
    `\u000d` is the carriage return; `\u000a` would be the newline. Either of them finishes the `//` comment. – pts Jul 09 '15 at 21:30

7 Answers7

752

Unicode decoding takes place before any other lexical translation. The key benefit of this is that it makes it trivial to go back and forth between ASCII and any other encoding. You don't even need to figure out where comments begin and end!

As stated in JLS Section 3.3 this allows any ASCII based tool to process the source files:

[...] The Java programming language specifies a standard way of transforming a program written in Unicode into ASCII that changes a program into a form that can be processed by ASCII-based tools. [...]

This gives a fundamental guarantee for platform independence (independence of supported character sets) which has always been a key goal for the Java platform.

Being able to write any Unicode character anywhere in the file is a neat feature, and especially important in comments, when documenting code in non-latin languages. The fact that it can interfere with the semantics in such subtle ways is just an (unfortunate) side-effect.

There are many gotchas on this theme and Java Puzzlers by Joshua Bloch and Neal Gafter included the following variant:

Is this a legal Java program? If so, what does it print?

\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020\u0020
\u0063\u006c\u0061\u0073\u0073\u0020\u0055\u0067\u006c\u0079
\u007b\u0070\u0075\u0062\u006c\u0069\u0063\u0020\u0020\u0020
\u0020\u0020\u0020\u0020\u0073\u0074\u0061\u0074\u0069\u0063
\u0076\u006f\u0069\u0064\u0020\u006d\u0061\u0069\u006e\u0028
\u0053\u0074\u0072\u0069\u006e\u0067\u005b\u005d\u0020\u0020
\u0020\u0020\u0020\u0020\u0061\u0072\u0067\u0073\u0029\u007b
\u0053\u0079\u0073\u0074\u0065\u006d\u002e\u006f\u0075\u0074
\u002e\u0070\u0072\u0069\u006e\u0074\u006c\u006e\u0028\u0020
\u0022\u0048\u0065\u006c\u006c\u006f\u0020\u0077\u0022\u002b
\u0022\u006f\u0072\u006c\u0064\u0022\u0029\u003b\u007d\u007d

(This program turns out to be a plain "Hello World" program.)

In the solution to the puzzler, they point out the following:

More seriously, this puzzle serves to reinforce the lessons of the previous three: Unicode escapes are essential when you need to insert characters that can’t be represented in any other way into your program. Avoid them in all other cases.


Source: Java: Executing code in comments?!

aioobe
  • 383,660
  • 99
  • 774
  • 796
  • 84
    In short then, Java intentionally allows it: the "bug" is in the OP's IDE? – Bathsheba Jun 09 '15 at 09:15
  • 61
    @Bathsheba: It's more in the heads of people. People don't try to understand how Java parsing works, so IDEs sometimes display the code in a wrong way. In the example above, the comment should end with `\u000d` and the part after it should have code highlights. – Aaron Digulla Jun 09 '15 at 09:17
  • 62
    Another common mistake is to paste Windows paths in the code like `// C:\user\...` which leads to a compile error since `\user` isn't a valid Unicode escape sequence. – Aaron Digulla Jun 09 '15 at 09:18
  • 2
    I understand introduction of unicode characters, but not so much why it is allowed in comments? – Reg Jun 09 '15 at 09:19
  • 50
    In eclipse the Code after `\u000d` is highlighted partially. After pressing Ctrl+Shift+F the character is replaced with new line and rest of line is wrapped – bluelDe Jun 09 '15 at 09:21
  • So.. this is related to how the compiler parses the source-code file?. This problem can't be reproduced when we use *block comments* instead of single line comments – TheLostMind Jun 09 '15 at 09:22
  • 2
    @Reg, there are many features of the language that doesn't make sense in conjunction with other features of the language. In this case, the language designers put the unicode escape handling before the parser, and being able to use unicode escapes in commens was simply a (possibly unfortunate) side effect. – aioobe Jun 09 '15 at 09:22
  • 5
    While I agree with the answer from @aioobe that the source code is valid and the problem is rather in the IDE (and the source code highlighter on StackOverflow), please note that there is another "problem" with the code. The CR character entered as unicode escape sequence is interpreted as a correct start of a new line, but the line number is not incremented. – Gregor Raýman Jun 09 '15 at 09:27
  • 7
    @UmaKanth, `//` comments are skipped all the way to the next new-line character. `\u000d` is interpreted as a new-line character. – aioobe Jun 09 '15 at 09:31
  • 20
    @TheLostMind If I understand the answer correctly you should be able to reproduce this with block comments as well. `\u002A/` should end the comment. – Taemyr Jun 09 '15 at 11:27
  • 11
    @Taemyr wow, `\u002A/` is really evil, eclipse utterly fails to parse it. Put code between `/*\u002A/` and `/\u002a*/` and it's completely hidden as comment. Found this as [bug 3533](https://bugs.eclipse.org/bugs/show_bug.cgi?id=3533) – Dorus Jun 09 '15 at 18:23
  • 8
    Note that this could have been avoided completely if the language specification had forbidden using `\u` notation to represent anything representable in ASCII. – R.. GitHub STOP HELPING ICE Jun 12 '15 at 22:47
  • 4
    @r good point, well made. At the very least this should be a Level 1 Compiler Warning. – Ben Jun 13 '15 at 18:23
  • 2
    @R..: That would make it so that if you're developing on a computer that doesn't use ASCII, then there are many characters you can't type in, such as IBM Mainframes that use EBCDIC, which doesn't have curly braces. – Mooing Duck Jun 13 '15 at 21:58
  • @TBohne: Do you actually have in mind such a character? – R.. GitHub STOP HELPING ICE Jun 13 '15 at 22:01
  • @R..: Yes, curly braces. `{}` – Mooing Duck Jun 14 '15 at 01:43
  • 1
    @TBohne: Wikipedia claims they're at positions C0 and D0 in EBCDIC. It seems pretty ridiculous to expect programmers to use `\u` escapes for something as ubiquitous as braces... – R.. GitHub STOP HELPING ICE Jun 14 '15 at 02:20
  • @R.: A quick glance shows you're right. But it also contains "Portability is hindered by a lack of many symbols commonly used in programming and in network communications, such as the curly braces." and "It exists in at least six mutually incompatible versions". I assume it must be a different version. – Mooing Duck Jun 14 '15 at 15:54
  • @R..: One wouldn't have to forbid everything in ASCII if one were to specify that the first pass of compilation is subdivision into lines, and any new-line characters that get introduced after that will be processed as-is, such that `string st="Hello\u000D\u000Athere"` would generate a twelve-character string containing a carriage return and a new-line. – supercat Sep 27 '15 at 19:47
  • 6
    If ever anyone is skeptic and want to test the hello world program, the class should be named "Ugly.java". There is other funny thing that may be caused by this... For example, insert `LRM` character will allow you to compile code such as `for (char c‮ = 1; c‮ > 0; c‮++)` – Jean-François Savard Oct 07 '15 at 17:17
145

Since this hasn’t addressed yet, here an explanation, why the translation of Unicode escapes happens before any other source code processing:

The idea behind it was that it allows lossless translations of Java source code between different character encodings. Today, there is widespread Unicode support, and this doesn’t look like a problem, but back then it wasn’t easy for a developer from a western country to receive some source code from his Asian colleague containing Asian characters, make some changes (including compiling and testing it) and sending the result back, all without damaging something.

So, Java source code can be written in any encoding and allows a wide range of characters within identifiers, character and String literals and comments. Then, in order to transfer it losslessly, all characters not supported by the target encoding are replaced by their Unicode escapes.

This is a reversible process and the interesting point is that the translation can be done by a tool which doesn’t need to know anything about the Java source code syntax as the translation rule is not dependent on it. This works as the translation to their actual Unicode characters inside the compiler happens independently to the Java source code syntax as well. It implies that you can perform an arbitrary number of translation steps in both directions without ever changing the meaning of the source code.

This is the reason for another weird feature which hasn’t even mentioned: the \uuuuuuxxxx syntax:

When a translation tool is escaping characters and encounters a sequence that is already an escaped sequence, it should insert an additional u into the sequence, converting \ucafe to \uucafe. The meaning doesn’t change, but when converting into the other direction, the tool should just remove one u and replace only sequences containing a single u by their Unicode characters. That way, even Unicode escapes are retained in their original form when converting back and forth. I guess, no-one ever used that feature…

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Holger
  • 243,335
  • 30
  • 362
  • 661
  • 2
    Interestingly, `native2ascii` doesn't seem to use the `\uu...xxxx` syntax, – ninjalj Jun 09 '15 at 18:17
  • 5
    Yeah, `native2ascii` was intended to help preparing resource bundles by converting them to iso-latin-1 as [`Properties.load`](http://docs.oracle.com/javase/8/docs/api/java/util/Properties.html#load-java.io.InputStream-) was fixed to read latin-1 only. And there, the rules are different, no `\uuu…` syntax and no early processing stage. In property files, `property=multi\u000aline` is indeed the same as `property=multi\nline`. (Contradicting to the phrase “using Unicode escapes as defined in section 3.3 of The Java™ Language Specification” of the documentation) – Holger Jun 09 '15 at 18:52
  • 10
    Note that this design goal could have been achieved without any of the warts; the easiest way would have been to forbid `\u` escapes to generate characters in the U+0000–007F range. (All such characters can be represented natively by all the national encodings that were relevant in the 1990s—well, maybe except some of the control characters, but you don't need those to write Java anyway.) – zwol Jun 09 '15 at 19:28
  • 3
    @zwol: well, if you exclude control characters which aren’t allowed within Java source code anyway, you are right. Nevertheless, it would imply making rules more complicated. And today, it’s too late to discuss the decision… – Holger Jun 09 '15 at 19:34
  • ah the problem of saving a document in utf8 and not latin or something else. All my databases were broken as well because of this western nonsense – David 天宇 Wong Jun 17 '15 at 21:21
108

I'm going to completely ineffectually add the point, just because I can't help myself and I haven't seen it made yet, that the question is invalid since it contains a hidden premise which is wrong, namely that the code is in a comment!

In Java source code \u000d is equivalent in every way to an ASCII CR character. It is a line ending, plain and simple, wherever it occurs. The formatting in the question is misleading, what that sequence of characters actually syntactically corresponds to is:

public static void main(String... args) {
   // The comment below is no typo. 
   // 
 System.out.println("Hello World!");
}

IMHO the most correct answer is therefore: the code executes because it isn't in a comment; it's on the next line. "Executing code in comments" is not allowed in Java, just like you would expect.

Much of the confusion stems from the fact that syntax highlighters and IDEs aren't sophisticated enough to take this situation into account. They either don't process the unicode escapes at all, or they do it after parsing the code instead of before, like javac does.

Pepijn Schmitz
  • 1,916
  • 1
  • 15
  • 15
  • 6
    I agree, this isn't a java "design error" , but it's an IDE bug. – bvdb Jun 22 '17 at 12:59
  • 3
    The question is rather about why code that _looks_ like a comment to someone not familiar with this particular aspect of the language and perhaps without reference to syntax highlighting, is in fact _not_ a comment. Objecting on the basis of the premise of the question being invalid is disingenuous. – Phil Jun 15 '18 at 05:37
  • 1
    @Phil: it only looks like a comment when viewed with particular tools, others show it otherwise. – jmoreno Feb 06 '19 at 12:00
  • 1
    @jmoreno one should not _have_ to have anything more than a text editor to read code. At the very least, it violates the principal of least surprise, namely that // style comments continue until the next \n character - not to any other sequence which is ultimately replaced by \n eventually. Comments are never expected to be anything other than stripped. Bad preprocessor. – Phil Feb 07 '19 at 08:15
  • So in order to ask a question we have to already know the answer in order to confirm that our question was, in fact, valid? I don't think the question is "invalid" - though it may contain an incorrect assumption. – StayOnTarget Jan 08 '21 at 20:38
69

The \u000d escape terminates a comment because \u escapes are uniformly converted to the corresponding Unicode characters before the program is tokenized. You could equally use \u0057\u0057 instead of // to begin a comment.

This is a bug in your IDE, which should syntax-highlight the line to make it clear that the \u000d ends the comment.

This is also a design error in the language. It can't be corrected now, because that would break programs that depend on it. \u escapes should either be converted to the corresponding Unicode character by the compiler only in contexts where that "makes sense" (string literals and identifiers, and probably nowhere else) or they should have been forbidden to generate characters in the U+0000–007F range, or both. Either of those semantics would have prevented the comment from being terminated by the \u000d escape, without interfering with the cases where \u escapes are useful—note that that includes use of \u escapes inside comments as a way to encode comments in a non-Latin script, because the text editor could take a broader view of where \u escapes are significant than the compiler does. (I am not aware of any editor or IDE that will display \u escapes as the corresponding characters in any context, though.)

There is a similar design error in the C family,1 where backslash-newline is processed before comment boundaries are determined, so e.g.

// this is a comment \
   this is still in the comment!

I bring this up to illustrate that it happens to be easy to make this particular design error, and not realize that it's an error until it is too late to correct it, if you are used to thinking about tokenization and parsing the way compiler programmers think about tokenization and parsing. Basically, if you have already defined your formal grammar and then someone comes up with a syntactic special case — trigraphs, backslash-newline, encoding arbitrary Unicode characters in source files limited to ASCII, whatever — that needs to be wedged in, it's easier to add a transformation pass before the tokenizer than it is to redefine the tokenizer to pay attention to where it makes sense to use that special case.

1 For pedants: I am aware that this aspect of C was 100% intentional, with the rationale — I am not making this up — that it would allow you to mechanically force-fit code with arbitrarily long lines onto punched cards. It was still an incorrect design decision.

zwol
  • 121,956
  • 33
  • 219
  • 328
  • 17
    I wouldn't go as far as saying that it's a design *error*. I could agree with you that it was a poor design choice, or a choice with unfortunate consequences, but I still think that it works as the language designers intended: It enables you to use any unicode character anywhere in the file, while maintaining ASCII encoding of the file. – aioobe Jun 09 '15 at 15:29
  • I would think that if the rationale was as stated, then backslash followed by some specific other character (e.g. `!`) should have indicated that the remainder of the physical line should be ignored, and the first character of the next line should be regarded as directly following the character before the backslash. That would allow `\!` to be punched in columns 71-72, leaving the eight columns available for sequence numbers. In some contexts the marker-stripe trick could lessen the need for machine-readable numbers, but I wouldn't think it would eliminate it. – supercat Jun 09 '15 at 16:08
  • 12
    That having been said, I think the choice of processing stage for `\u` was less absurd than the decision to follow C's lead in using leading zeroes for octal notation. While octal notation is sometimes useful, I've yet to hear anyone articulate an argument why a leading zero is a good way of indicating it. – supercat Jun 09 '15 at 16:09
  • 3
    @supercat The people who threw that feature into C89 were generalizing the behavior of the original K&R preprocessor rather than designing a feature from scratch. I doubt they were familiar with punched card best practices, and I also doubt that the feature has *ever* been used for its stated purpose, except maybe for one or two retrocomputing exercises. – zwol Jun 09 '15 at 18:33
  • 8
    @supercat I wouldn't have a problem with Java `\u` as pre-tokenization transformation if it were forbidden to produce characters in the U+0000..U+007F range. It's the combination of "this works everywhere" and "this aliases ASCII characters with syntactic significance" that demotes it from awkward to flat-out wrong. – zwol Jun 09 '15 at 18:34
  • @zwol: I could go along with that, though in general I'm not a big fan of how languages approach non-ASCII identifiers. Since Unicode includes a lot of homoglyphs, and languages that allow Unicode identifiers often impose minimal restrictions on their use, it's excessively difficult to produce a program listing which is human readable but semantically unambiguous. – supercat Jun 09 '15 at 18:42
  • @supercat Yeah, even Unicode's own recommendations for how to do identifiers in programming languages are too loosey-goosey for me to be comfortable with. – zwol Jun 09 '15 at 19:17
  • @zwol: Personally, I think programming languages should define tight and loose matching criteria, and require that identifiers must match tightly to be considered a match, but should shadow all identifiers that match loosely (such a rule should apply to upper/lower case in ASCII, but also in many Unicode scenarios). Thus, if `Foo` is defined in an outer context and `foo` is defined in an inner one, then within the inner context `foo` would refer to the latter identifier and `Foo` would be a syntax error. Applying such a rule to homoglyphs, but with a means to override it in special cases... – supercat Jun 09 '15 at 19:25
  • ...(e.g. explicitly tell the compiler "I want identifiers `foo` and `Foo`, or `Χ` and `X`, to both be accessible here) would help guard against a lot of ambiguous situations. – supercat Jun 09 '15 at 19:33
  • 3
    @supercat: today, IDEs do that. The “loose matching criteria” often consists of a single letter, then the IDE fills in the remaining characters to make it an appropriate “tight matching criteria” and I don’t think that compilers should ever deal with the “loose matching criteria”. I.e, I don’t wont a compiler that happily resolves occurrence of `i` to `I` and when somebody compiles it on a Turkish locale, `i` is suddenly resolved to `İ`… – Holger Jun 09 '15 at 19:44
  • @Holger: Under the rules I'd like to see, within a scope where `Six` was defined, identifiers `six`, `SİX`, `Sıx`, etc. would not be usable, even if they existed in outer scopes. Collisions may result in syntax errors that require an explicit "distinguish these identifiers" directive, but could not change the meaning of code that still compiled. – supercat Jun 09 '15 at 19:52
  • 4
    On your "for pedants": Of course at that time [the `//` single-line comment didn't exist](http://stackoverflow.com/q/8284940/256431). And since C has a statement terminator that is not a new line, it would mostly be used for long strings, except that as far as I can determine "string literal concatenation" _was_ there from K&R. – Mark Hurd Jun 16 '15 at 17:39
22

This was an intentional design choice that goes all the way back to the original design of Java.

To those folks who ask "who wants Unicode escapes in comments?", I presume they are folks whose native language uses the Latin character set. In other words, it is inherent in the original design of Java that folks could use arbitrary Unicode characters wherever legal in a Java program, most typically in comments and strings.

It is arguably a shortcoming in programs (like IDEs) used to view the source text that such programs cannot interpret the Unicode escapes and display the corresponding glyph.

galath
  • 5,009
  • 10
  • 27
  • 39
21

I agree with @zwol that this is a design mistake; but I'm even more critical of it.

\u escape is useful in string and char literals; and that's the only place that it should exist. It should be handled the same way as other escapes like \n; and "\u000A" should mean exactly "\n".

There is absolutely no point of having \uxxxx in comments - nobody can read that.

Similarly, there's no point of using \uxxxx in other part of the program. The only exception is probably in public APIs that are coerced to contain some non-ascii chars - what's the last time we've seen that?

The designers had their reasons in 1995, but 20 years later, this appears to be a wrong choice.

(question to readers - why does this question keep getting new votes? is this question linked from somewhere popular?)

ZhongYu
  • 18,232
  • 5
  • 28
  • 55
  • 5
    I guess, you are not hanging around, where non-ASCII characters are used in APIs. There are people using it (not me), e.g. in Asian countries. And when you are using non-ASCII characters in identifiers, forbidding them in documentation comments makes little sense. Nevertheless, allowing them inside a token and allowing them to change the meaning or boundary of a token are different things. – Holger Jun 09 '15 at 17:25
  • 15
    they can use proper file encoding. why write `int \u5431` when you can do `int 整` – ZhongYu Jun 09 '15 at 17:29
  • 3
    What will you do when *you* have to compile code against their API and cannot use the proper encoding (assume that there wasn’t widespread `UTF-8` support in 1995). You just have to call one method and don’t want to install the Asian language support pack of your operating system (remember, the nineties) for that single method… – Holger Jun 09 '15 at 17:34
  • 1
    Is that an imaginary scenario? I don't think it happens in real world. – ZhongYu Jun 09 '15 at 17:37
  • 1
    It would be even worse if arbitrary characters were allowed in identifiers but at the same time, accessing these identifiers from certain locales were impossible. When you design a language, you should decide. I can live with a language restricting symbols to ASCII as I see the problems with localized source code. But I’m also an active user of the all-English stackoverflow site, so I (and probably you as well) have a bias. We know the worth of being able to talk with others (on an international site) about the code. By the way, I left an answer explaining the original intent (afair)… – Holger Jun 09 '15 at 18:09
  • 5
    What is much clearer now than 1995 is that you better know English if you want to program. Programming is an international interaction, and almost all resources are in English. – ZhongYu Jun 09 '15 at 18:16
  • @Holger: non-ASCII in identifiers is another can of worms, since it is not only non-ASCII alphanumeric, but it includes too much, including control codes: http://stackoverflow.com/questions/4838507/why-does-java-allow-control-characters-in-its-identifiers – ninjalj Jun 09 '15 at 18:20
  • 8
    I don’t think that this has changed. Java’s documentation was all-English most of the time as well. There was a Japanese translation maintained for a while but maintaining *two* languages doesn’t really back up the idea of maintaining it for all the locales of the world (it rather disproved it). And before that, there was no mainstream language with Unicode support in identifiers anyway. So I would guess, somebody *thought* that localized source code was the next big thing. I would say *thankfully*, it didn’t take off. – Holger Jun 09 '15 at 18:24
  • 1
    @ninjalj: yeah, I like what you can do with embedded right-to-left writing but also things as simple as the fact that `ä` and `ä` are different identifiers (because one is `U+0061U+0308` and the other `U+00E4`). – Holger Jun 09 '15 at 18:29
  • @Holger: RTL itself can also be confusing. There was a question which I cannot find right now where the OP was trying to match a substring on a string: the arguments were reversed. – ninjalj Jun 09 '15 at 18:36
  • @StephenP - you are probably thinking `%n` in `format()`. `\n` means exactly the character `0x0a`, see http://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.10.6 – ZhongYu Jun 10 '15 at 17:48
  • @bayou.io I feel that unicode could be valid in a comment... more specifically a documenting comment /** ... */ with a description that will be generated into an HTML javadoc page; now in that case I'd probably still use a literal newline over this, and for a documenting comment, it wouldn't suffer this issue unless I had the unicode characters for BOTH * and / in the comment directly after one another because documenting comments are not terminated by a single new line character. – anonymous Jun 10 '15 at 21:18
  • @anonymous - good point. however we can use xml escaping there - `⪹` -> ⪹ – ZhongYu Jun 10 '15 at 22:41
  • 1
    @bayou.io What am I supposed to do when I have to model something that doesn't have an English name? This is pretty common if you ever deal with domains like law or business or similar that lack these things. Especially in legal domains, words have very specific meanings. Imagine if the standard alphabet didn't have C, X or Q. Now you have a class called "KommonLaw" or something. You'd want to use the 'C'. In your world, that's wrong. But what if KommonLaw meant something else. Now what? You'd at some point try to use a language that let you use 'C' instead, probably. – Haakon Løtveit Feb 28 '16 at 08:24
  • @HaakonLøtveit - can't you use the character directly, instead of the escape sequence, e.g. `class Løtveit` instead of `class L\u00D8tveit` – ZhongYu Feb 28 '16 at 19:26
  • That would work great for me, but then you have to write "new Løe()" somewhere, and you'd probably get really tired of copypasting 'ø's real soon. Letting you sub \u00D8 those places would likely be easier on your sanity. (Or you'd just use the IBM international layout, but that's because it supports most western European characters. But then there's Pinyin etc.) – Haakon Løtveit Feb 28 '16 at 21:11
  • @HaakonLøtveit -- I didn't have type or copy `@HaakonLøtveit`, the editor does it for me by auto completion. Same for Java. Even if I have to copy `ø`, it's probably easier than finding and typing its unicode. – ZhongYu Feb 29 '16 at 00:32
  • 1
    Yes. Your editor today, in 2016 does that. But Java was released in 1995. Emacs didn't have semantic autocompletion back then, and was the most advanced thing that was available for Java. It didn't even have unicode support. – Haakon Løtveit Feb 29 '16 at 09:02
11

The only people who can answer why Unicode escapes were implemented as they were are the people who wrote the specification.

A plausible reason for this is that there was the desire to allow the entire BMP as possible characters of Java source code. This presents a problem though:

  • You want to be able to use any BMP character.
  • You want to be able to input any BMP charater reasonably easy. A way to do this is with Unicode escapes.
  • You want to keep the lexical specification easy for humans to read and write, and reasonably easy to implement as well.

This is incredibly difficult when Unicode escapes enter the fray: it creates a whole load of new lexer rules.

The easy way out is to do lexing in two steps: first search and replace all Unicode escapes with the character it represents, and then parse the resulting document as if Unicode escapes don't exist.

The upside to this is that it's easy to specify, so it makes the specification simpler, and it's easy to implement.

The downside is, well, your example.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Martijn
  • 11,183
  • 10
  • 46
  • 92
  • 2
    Or, restrict the use of \uxxxx to identifiers, string literals, and character constants. Which is what C11 does. – ninjalj Jun 13 '15 at 12:33
  • that really complicates the parser rules though, because those are what define those things, which is what I'm speculating is part of the reason it is the way it is. – Martijn Jun 13 '15 at 16:11