57

I've been doing a bit of reading around the subject of Unicode -- specifically, UTF-8 -- (non) support in C++11, and I was hoping the gurus on Stack Overflow could reassure me that my understanding is correct, or point out where I've misunderstood or missed something if that is the case.

A short summary

First, the good: you can define UTF-8, UTF-16 and UCS-4 literals in your source code. Also, the <locale> header contains several std::codecvt implementations which can convert between any of UTF-8, UTF-16, UCS-4 and the platform multibyte encoding (although the API seems, to put it mildly, less than than straightforward). These codecvt implementations can be imbue()'d on streams to allow you to do conversion as you read or write a file (or other stream).

[EDIT: Cubbi points out in the comments that I neglected to mention the <codecvt> header, which provides std::codecvt implementations which do not depend on a locale. Also, the std::wstring_convert and wbuffer_convert functions can use these codecvts to convert strings and buffers directly, not relying on streams.]

C++11 also includes the C99/C11 <uchar.h> header which contains functions to convert individual characters from the platform multibyte encoding (which may or may not be UTF-8) to and from UCS-2 and UCS-4.

However, that's about the extent of it. While you can of course store UTF-8 text in a std::string, there are no ways that I can see to do anything really useful with it. For example, other than defining a literal in your code, you can't validate an array of bytes as containing valid UTF-8, you can't find out the length (i.e. number of Unicode characters, for some definition of "character") of a UTF-8-containing std::string, and you can't iterate over a std::string in any way other than byte-by-byte.

Similarly, even the C++11 addition of std::u16string doesn't really support UTF-16, but only the older UCS-2 -- it has no support for surrogate pairs, leaving you with just the BMP.

Observations

Given that UTF-8 is the standard way of handling Unicode on pretty much every Unix-derived system (including Mac OS X and* Linux) and has largely become the de-facto standard on the web, the lack of support in modern C++ seems like a pretty severe omission. Even on Windows, the fact that the new std::u16string doesn't really support UTF-16 seems somewhat regrettable.

* As pointed out in the comments and made clear here, the BSD-derived parts of Mac OS use UTF-8 while Cocoa uses UTF-16.

Questions

If you managed to read all that, thanks! Just a couple of quick questions, as this is Stack Overflow after all...

  • Is the above analysis correct, or are there any other Unicode-supporting facilities I'm missing?

  • The standards committee has done a fantastic job in the last couple of years moving C++ forward at a rapid pace. They're all smart people and I assume they're well aware of the above shortcomings. Is there a particular well-known reason that Unicode support remains so poor in C++?

  • Going forward, does anybody know of any proposals to rectify the situation? A quick search on isocpp.org didn't seem to reveal anything.

EDIT: Thanks everybody for your responses. I have to confess that I find them slightly disheartening -- it looks like the status quo is unlikely to change in the near future. If there is a consensus among the cognoscenti, it seems to be that complete Unicode support is just too hard, and that any solution must reimplement most of ICU to be considered useful.

I personally don't agree with this; I think there is valuable middle ground to be found. For example, the validation and normalisation algorithms for UTF-8 and UTF-16 are well-specified by the Unicode consortium, and could be supplied by the standard library as free functions in, say, a std::unicode namespace. These alone would be a great help for C++ programmes which need to interface with libraries expecting Unicode input. But based on the answer below (tinged, it must be said, with a hint of bitterness) it seems Puppy's proposal for just this sort of limited functionality was not well-received.

Community
  • 1
  • 1
Tristan Brindle
  • 15,036
  • 2
  • 31
  • 79
  • 4
    Very sad ... but you're assessment is absolutely correct. – FoggyDay Aug 11 '14 at 17:59
  • 7
    Related/duplicate: [How well is Unicode supported in C++11?](http://stackoverflow.com/q/17103925) – dyp Aug 11 '14 at 18:15
  • 4
    C++ supports the representation of various encoded strings. It does not contain a text processing library. This is a large, complex job best left to a dedicated, specialized library. – Kerrek SB Aug 11 '14 at 18:24
  • 8
    @KerrekSB With respect, I disagree. There was a recent proposal from Herb Sutter and others to add a 2-d vector graphics API to the standard library, which is surely far more out of scope than offering good support for internationalised text! By comparison, here is the Unicode API offered by GLib: https://developer.gnome.org/glib/2.30/glib-Unicode-Manipulation.html . Would it be such a stretch for ISO C++ to offer something similar for UTF-8 and UTF-16? – Tristan Brindle Aug 11 '14 at 18:38
  • 1
    @TristanBrindle That API is broken due to its use of fixed size gunichar, and I don't think it would represent an advance over what we have in C++ now. Codepoints are largely useless. A real character data type, if used at all, needs to support arbitrarily large extended grapheme clusters. APIs for text transformation should almost always work on whole strings due to oddities such as the singled character 'ß' being the lower case version of the two characters 'SS'. If one cannot do `assert("SS" == toupper("ß"))` then the API is broken. – bames53 Aug 11 '14 at 18:57
  • 2
    @TristanBrindle, I spent some time with Unicode and even managed to write own implementation that passed all 6.0 tests without errors :D Unicode is not a part of the standard. std:string is the same stupid container. If you need good Unicode support - use ICU (http://www.icu-project.org/). It is reliable and sustain solution that can use std::string. – Tanuki Aug 11 '14 at 19:02
  • 3
    @TristanBrindle: I see it like this: *Text* is an incredibly complicated, specialized topic area. It requires a vast database to live off, and tons of domain expertise. A complete library solution for operating with text would probably rival the entire existing standard library in size. That's why I think it should perhaps not be part of the standard itself. The language should provide facilities to *implement* such a library (which it now does), but that's enough. I'm perfectly happy to have a separate "Text TS", similar to the proposed drawing TS, of course. – Kerrek SB Aug 11 '14 at 20:30
  • 7
    `std::u16string` can hold UTF-16 surrogates. Just as `std::string` can hold multiple `char` codeunits representing a single Unicode codepoint, so can `std::u16string` hold multiple `char16_t` codeunits for a single codepoint (just like `std::wstring` can on Windows). Just because the STL doesn't have many provisions for *processing* UTF data does not mean STL containers cannot *hold* UTF data in its entirety. `std::u16string` is intended for holding UTF-16 data, surrogates and all. – Remy Lebeau Aug 12 '14 at 05:47
  • ASCII:ness is not present in the writing system of many cultures, there might not be the concept of a character or accented character. Text adaption including sizing then has to be done with something like a UI Rect structure. This was kept out of C++ which is really fine with me. – Jojje Aug 12 '14 at 11:48
  • You are forgetting the locale-independent Unicode conversion facilities of C++11, the `` header: http://en.cppreference.com/w/cpp/header/codecvt Also, you don't have to imbue, you can [wbuffer_convert](http://en.cppreference.com/w/cpp/locale/wbuffer_convert) or [wstring_convert](http://en.cppreference.com/w/cpp/locale/wstring_convert) – Cubbi Aug 13 '14 at 17:20
  • @bames53: You would like to be able to say `assert("SS" == toupper("ß"))`. Would you also like to be able to say `assert(tolower("SS") == "ß")`? Because that assertion would have to fail, right? I'm not saying you're wrong, I'm just curious how you would like the API to behave here. – TonyK Aug 13 '14 at 19:10
  • @TonyK I believe that not all instances of lowercase 'ss' in German are correctly written as 'ß'. If there are relatively consistent rules then they should be part of the locale, then using `tolower()` on a whole word should produce the appropriate lowercase spelling. In cases where the necessary context isn't available then there probably needs to be enough configurability for the program to indicate which it wants. – bames53 Aug 13 '14 at 19:53
  • 3
    @bames53: Well, exactly. There is no systematic rule that will tell you whether `tolower("SS")` should be `"ss"` or `"ß"` -- you would need a comprehensive dictionary for that, with proper nouns and everything. Would you say that an API that doesn't implement this feature is "broken"? – TonyK Aug 13 '14 at 20:43
  • @TonyK If the interface can't accommodate turning "SS" into "ß" then yes, it's broken. It's not necessary that it have the linguistic data to do so correctly built-in. – bames53 Aug 13 '14 at 21:14
  • @bames53: Yes, yes it really is. The UCD contains all the relevant data. There is no question of "How would you like the API to behave?". The Unicode Consortium already defined how case conversions work. – Puppy Aug 13 '14 at 22:03
  • @Puppy Sorry, I was unclear: By 'it's not necessary' I mean 'I won't count it as broken just because the data is not built in'. – bames53 Aug 13 '14 at 22:09
  • @bames53 I am German and I would consider a library not to fail on `assert("SS" == toupper("ß"))` broken. There is no upper case `'ß'` and therefore `toupper("ß")` should raise an exception or return a `NOT_A_CHARACTER`. Just because something does not meet your expectations does not mean it is broken. – nwp Aug 14 '14 at 12:47
  • 2
    @nwp I'm German, too, but that has nothing to do with the fact that `assert("SS" == toupper("ß"))` has to work. Of course, the fundamental reason behind that is that is is historically correct in German, even if many of us are not aware of that spelling rule – but in the end, the reason for the library is much simpler: Unicode defines conversion to uppercase characters that way. (See, e.g., http://www.unicode.org/faq/casemap_charprop.html#11 and a couple of other entries in that FAQ list.) – Christopher Creutzig Sep 05 '14 at 11:05
  • 1
    Having written acres of code for Mac OS X I can assure you that UTF-8 is *not* the standard way of handling Unicode on that platform. Most Mac apps make heavy use of the Foundation or Core Foundation libraries (even for handling URLs and file names) and those are built on NSString and CFStringRef, both of which utilize UTF-16. The only time I've ever used UTF-8 in Mac code was when preparing text for output to std::cerr. – Belden Fox Sep 06 '14 at 17:02
  • UTF-8 uses a clever strategy to store commonly used symbols with one byte and everything else with up to 4 bytes. This property makes it difficult to directly process UTF-8, therefore I think it is very good that these complex matters are not part of the C++ standard. If every character is represented as a 4-byte value then string operations can be realized in a more general way, e.g., they can assume that the next character is at index +1 in the string. So just convert to 4-byte-unicode as early as possible, process this representation, and convert back to UTF8 as late as possible for output. – peschü Sep 11 '14 at 10:54
  • @TristanBrindle, personally I use http://utfcpp.sourceforge.net/ to support correct UTF-8, UTF-16, UTF-32 - conversion and main functions needed to work with UTF encoding. This library is headers only and lightweight. I think this library is a nice extension to default C++ functionality. Regarding standard C++ - I never found all functions needed for UTF/Unicode handling. – Arty Sep 21 '14 at 15:01
  • "far more out of scope than offering good support for internationalised text". Processing internationalised text is more complicated than simple 2D graphics, and goes beyond "mere" Unicode support. – n. 'pronouns' m. Sep 30 '14 at 17:47
  • Check out http://stackoverflow.com/questions/38688417/utf-conversion-functions-in-c11 – Brent Aug 05 '16 at 22:44
  • 1
    Possible duplicate of [How well is Unicode supported in C++11?](https://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11) – Raedwald Dec 12 '17 at 22:42

2 Answers2

9

Is the above analysis correct

Let's see.

you can't validate an array of bytes as containing valid UTF-8

Incorrect. std::codecvt_utf8<char32_t>::length(start, end, max_lenght) returns the number of valid bytes in the array.

you can't find out the length

Partially correct. One can convert to char32_t and find out the length of the result. There is no easy way to find out the length without doing the actual conversion (but see below). I must say that need to count characters (in any sense) arises rather infrequently.

you can't iterate over a std::string in any way other than byte-by-byte

Incorrect. std::codecvt_utf8<char32_t>::length(start, end, 1) gives you a possibility to iterate over UTF-8 "characters" (Unicode code units), and of course determine their number (that's not an "easy" way to count the number of characters, but it's a way).

doesn't really support UTF-16

Incorrect. One can convert to and from UTF-16 with e.g. std::codecvt_utf8_utf16<char16_t>. A result of conversion to UTF-16 is, well, UTF-16. It is not restricted to BMP.

Demo that illustrates these points.

If I have missed some other "you can't", please point it out and I will address it.

Important addendum. These facilities are deprecated in C++17. This probably means they will go away in some future version of C++. Use them at your own risk. All these things enumerated in original question now cannot (safely) be done again, using only the standard library.

n. 'pronouns' m.
  • 95,181
  • 13
  • 111
  • 206
  • This is great! I had no idea that `std::codecvt` could be coerced into doing these things, Google certainly didn't turn up any examples like this when I was looking. I still think that a true `std::unicode::string` (*a la* `QString`, `NSString` or Python3's string class) would be a worthwhile addition to standard C++, but it's nice to know there's more there already than I thought. Thanks. – Tristan Brindle Oct 01 '14 at 05:32
  • 1
    Technically, converting to a 32-bit representation doesn't get you the number of "characters" (i.e., graphemes) because of Unicode's support for combining marks. A single "perceived character" might be composed of three different codepoints, for example. http://unicode.org/faq/char_combmark.html – Jim Carnicelli Oct 01 '16 at 18:58
  • @JimCarnicelli The term "character" has a definition in Unicode, several in fact, all distinct from that of "grapheme". The two definitions relevant to C and C++ are (2) Synonym for abstract character and (3) The basic unit of encoding for the Unicode character encoding; whereas an abstract character is defined as a unit of information used for the organization, control, or representation of textual data. Don't see why we should abandon these definitions in favour of "grapheme", which is a distinct concept. – n. 'pronouns' m. Oct 02 '16 at 00:17
  • True, but to be sure, I'm not importing that terminology. The Unicode Standard Annex uses the term "grapheme clusters ('user-perceived characters')". It's techy but helpful. The standard also says that "An abstract character does not necessarily correspond to what a user thinks of as a 'character' and should not be confused with a grapheme". It's definitionally different. Counting characters, in the sense we would intuit if we were talking ASCII text, is different for Unicode. http://unicode.org/reports/tr29/tr29-9.html#Grapheme_Cluster_Boundaries – Jim Carnicelli Oct 03 '16 at 04:41
  • @JimCarnicelli Unless your program is doing typography, it need not be concerned with graphemes or grapheme clusters or user-perceived characters at all. They are just not relevant for most programmers. – n. 'pronouns' m. Oct 03 '16 at 06:54
  • I see your point. Most programmers are going to be concerned with storage and equivalence comparison of strings. Or outputting them to, say, a web browser to worry about displaying. Still, if a Unicode string is not normalized, even equivalence testing is suspect, in which case grapheme clusters are relevant. And there are some programmers like myself who need a count of perceived characters and, of course, to compare and sort "similar" text. Collation code is necessary for that. – Jim Carnicelli Oct 04 '16 at 15:29
  • Collation requires normalisation and locale-dependent knowledge, which is far beyond simple character counting. You need a collation library anyway. – n. 'pronouns' m. Oct 04 '16 at 15:41
-3

Is the above analysis correct, or are there any other Unicode-supporting facilities I'm missing?

You're also missing the utter failure of UTF-8 literals. They don't have a distinct type to narrow-character literals, that may have a totally unrelated (e.g. codepages) encoding. So not only did they not add any serious new facilities in C++11, they broke what little there was because now you can't even assume that a char* is in narrow-string-encoding for your platform unless UTF-8 is the narrow string encoding. So the new feature here is "We totally broke char-based strings on every platform where UTF-8 isn't the existing narrow string encoding".

The standards committee has done a fantastic job in the last couple of years moving C++ forward at a rapid pace. They're all smart people and I assume they're well aware of the above shortcomings. Is there a particular well-known reason that Unicode support remains so poor in C++?

The Committee simply doesn't seem to give a shit about Unicode.

Also, many of the Unicode support algorithms are just that- algorithms. This means that to offer a decent interface, we need ranges. And we all know that the Committee can't figure out what they want w.r.t. ranges. The new Iterables thing from Eric Niebler may have a shot.

Going forward, does anybody know of any proposals to rectify the situation? A quick search on isocpp.org didn't seem to reveal anything.

There was N3572, which I authored. But when I went to Bristol and presented it, there were a number of problems.

Firstly, it turns out that the Committee don't bother to feedback on non-Committee-member-authored proposals between meetings, resulting in months of lost work when you iterate on a design they don't want.

Secondly, it turns out that it's voted on by whoever happens to wander by at the time. This means that if your paper gets rescheduled, you have a relatively random bunch of people who may or may not know anything about the subject matter. Or indeed, anything at all.

Thirdly, for some reason they don't seem to view the current situation as a serious problem. You can get endless discussion about how exactly optional<T>'s comparison operations should be defined, but dealing with user input? Who cares about that?

Fourthly, each paper needs a champion, effectively, to present and maintain it. Given the previous issues, plus the fact that there's no way I could afford to travel to other meetings, it was certainly not going to be me, will not be me in the future unless you want to donate all my travel expenses and pay a salary on top, and nobody else seemed to care enough to put the effort in.

Martin Ba
  • 33,741
  • 27
  • 150
  • 304
Puppy
  • 138,897
  • 33
  • 232
  • 446
  • Probably. I forget which number was which. – Puppy Aug 13 '14 at 17:17
  • 5
    Although i didn't down vote it, i feel like this answer is slightly subjective ^^ – Drax Aug 14 '14 at 08:40
  • 1
    @Drax: Considering that I was the guy who went and tried to fix the problem described by the OP, it seems like a pretty definitive answer to me. – Puppy Aug 14 '14 at 09:34
  • 8
    This is a rant, not an answer. Also I would appreciate it if you replace "X is broken" with "X has the problems Y and is therefore not useful in situations Z". – nwp Aug 14 '14 at 12:30
  • Codepages are not used in UTF-8. That is MBCS that you are thinking of, which came before Unicode. – Ben Key Sep 01 '14 at 17:29
  • 2
    Are there parts of your proposal that could fit into Boost? That seems like a proving ground for lots of stuff. – Mark Ransom Sep 30 '14 at 11:39