How to enter non-BMP unicode (hexadecimal with more than 4 characters) as input to Mathematica

Question

Problem description: Mathematica use "\:nnnn" as the syntax for unicode input. E.g., if we enter "\:6c34", we get "水" ("water" in Chinese). But what if one wants to enter "\:1f618" (face throwing a kiss). When I tried this, I got "ὡ8", not "a face throwing a kiss". So, Mathematica evaluates "\:1f61" before I entered "8".

Question: How can we delay this evaluation or how can we enter any unicode input in general (as for hexadecimal with more than 4 characters)?

Software and hardware platform: I am running Mathematica 8 on an Intel Mac. I tried both the command line version of Mathematica and Mathematica notebook, they behave the same.

Thank you.

Reflections: Unicode is an extensible standard and it can grow (and it does grow:)). Software systems that implement this standard may only implement a subset of this standard in order to be valid and useful (8-bit, 16-bit or 32-bit encoding). One, as the user of a certain software package, should not make the assumption that once the software says it support unicode, it support the universal set of unicode.

Mathematica has virtually no support for Unicode code points that require more than 16 bits. See, for example, [Reading an UTF-8 encoded text file in Mathematica](http://stackoverflow.com/q/5597013/211232). — WReach, Nov 09 '11 at 01:21
Unicode 2.0 expanded the range of possible code points using the surrogate mechanism all of 15 years ago, and they still gave people almost 5 years to get with the program -- more than 10 years ago. There is simply no excuse for these crusty old program not to support full Unicode. 15 years is enough time for any upgrade. — tchrist, Nov 09 '11 at 01:30
It's kind of interesting that I can copy/paste a 32 bit unicode character into a Mathematica notebook and it looks correct, but does not survive saving and reloading the notebook. It gets translated into two 16 bit unicode characters. — Simon, Nov 09 '11 at 01:39

score 9 · Accepted Answer · edited May 23 '17 at 10:34

Short answer: You can't do this because Mathematica doesn't support these characters properly. See at the end of the post for some workarounds.

Just to clear up some things:

There's no need for a 32-bit encoding to handle more than ~65000 Unicode characters. The most common encodings used for Unicode, UTF-8 and UTF-16, are multibyte encodings, meaning that a variable number of bytes are used to represent characters. UTF-16 can use either 2 or 4 bytes to represent a character. The Mathematica kernel will interpret every 2-byte sequence as a single character in a string, resulting in some invalid characters on occasion (when encountering a 4-byte sequence). This may be considered a bug. The front end is quite moody about how it handles 4-byte sequences, which is definitely a bug.

Limited workaround

When working strictly in the kernel (e.g. reading the Unicode data from a file), I sometimes use this function as a workaround to get the actual Unicode code point of 2-unit (4-byte) UTF-16 sequences:

toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff := (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4

You can use

Split[ToCharacterCode[str], If[16^^d800 <= # <= 16^^dbff, True] &]

to split a UTF-16 string into Unicode characters correctly (either length-one or length-two, depending on the character).

This is an ugly and inconvenient workaround, and it will won't allow you to display anything of these characters in the front end unless you come up with some hack for that as well, e.g. importing the glyph reference images from unicode.org (at least for CJK they have them).

See also

See my earlier question on the same topic: Reading an UTF-8 encoded text file in Mathematica

If you are going to work with Chinese, you may come across this other problem too: Getting the Mathematica front end to obey the FontFamily option

Codie CodeMonkey · Answer 2 · 2011-11-09T01:30:49.247

5

According to this page in the Mathematica 8 help:

Mathematica supports both 8- and 16-bit raw character encodings.

Presumably they are saying that they don't support 32-bit encodings as would be needed to support your desired character.

As further evidence (in the absence of a clear statement in the documentation), the list of supported encodings on the same page has no 32-bit encodings. 32-bit encodings are apparently only supported in MathLink. I suppose there hasn't been enough user demand.

edited Nov 09 '11 at 01:30

answered Nov 09 '11 at 01:18

Codie CodeMonkey

6,899
2
25
43

2

To clear up some confusion: It is not necessary to "support a 32-bit encoding" to support Unicode properly. UTF-16 is a 16-bit encoding, but it can encode all Unicode characters, more than ~65000 of them, through the use of both one-word (2 byte) and two-word (4 byte) units. Mathematica simply fails to handle the two-word units correctly. From the comments on my earlier related question I have the impression that this is a *bug* as sometimes these are displayed correctly in Windows 7, but after saving and opening the notebook they get corrupted. http://stackoverflow.com/q/5597013/211232 – Szabolcs Nov 09 '11 at 08:11

How to enter non-BMP unicode (hexadecimal with more than 4 characters) as input to Mathematica

2 Answers2