Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

Question

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

ECMA-404: The JSON Data Interchange Format

I believe that there is no need to encode this character at all, so it could be represented directly as "". However, should one wish to encode it, it must, per spec, be encoded as "\uD834\uDD1E", not (as would seem reasonable) as "\u1d11e". Why is this?

For hysteric raisins. Sorry, historic reasons :-) And yes, you have the choice of directly using a Unicode code point or using an encoded surrogate pair. — gnasher729, Jul 19 '16 at 16:19
The `\u` escape format is limited to 4 hex digits, and thus can only encode 16bit values. 0x1D11E does not fit in 16 bits, thus the use of a surrogate pair. This goes back to the days of UCS-2 when all Unicode codepoints at the time fit nicely in 16 bits. When Unicode outgrew 16 bits, UTF-16 was invented to surpass the encoding limitation, but still needs to be backwards compatible with existing UCS-2 data and formats. — Remy Lebeau, Jul 19 '16 at 19:41
Supporting `"\u1d11e"` to mean `""` would make it difficult to encode the string `"ᴑe"` using `\u1d11` for the `ᴑ`. A better question would be why support for a sequence like `"\U0001d11e"` hasn't been added to the language (such as in Python and C++). — 一二三, Jul 20 '16 at 00:49
@一二三 ECMAScript 6 added a `\u{1d11e}` syntax (similar to Perl). — nwellnhof, Jul 24 '16 at 13:44
ECMAScript 6 — neat syntax, but JSON is already widely implemented, allowing a new syntax would require all upgrading all parsers, not worth the pain. — Beni Cherniavsky-Paskin, Sep 04 '17 at 14:56

score 6 · Accepted Answer · answered Jul 24 '16 at 13:38

One of the key architectural features of JSON is that JSON-encoded objects are valid Javascript literals that can be evaluated using the eval function, for example. Unfortunately, older Javascript implementations only support 16-bit Unicode escape sequences with four hex characters in string literals, so there's no other way than to use UTF-16 surrogates in escape sequences for code points above 0xFFFF in a portable way. (The \u{...} syntax that allows arbitrary code points was only introduced in ECMAScript 6.)

But as you mentioned, there's no need to use escape sequences if your application supports Unicode JSON text. Simply encode the characters directly in the respective Unicode format.

Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?

1 Answers1

Linked