9

I know the web is mostly standardizing towards UTF-8 lately and I was just wondering if there was any place where using UTF-8 would be a bad thing. I've heard the argument that UTF-8, 16, etc may use more space but in the end it has been negligible.

Also, what about in Windows programs, Linux shell and things of that nature -- can you safely use UTF-8 there?

Joe Phillips
  • 44,686
  • 25
  • 93
  • 148
  • For existing protocols that don't support UTF-8, that's a good reason not to use UTF-8 :) I personally only like to support UTF-8 encoding as it allows unicode characters while allowing my life to revolve around the ASCII character-space (opening up UTF-16 content in a "dumb" editor makes me eyes bleed). –  Jan 15 '11 at 00:05
  • @pst: B e c a u s e i t l o o k s l i k e t h i s ? – dan04 Jan 15 '11 at 02:54

3 Answers3

0

When you need to write a program (performing string manipulations) that needs to be very very fast and that you're sure that you won't need exotic characters, may be UTF-8 is not the best idea. In every other situations, UTF-8 should be a standard.

UTF-8 works well on almost every recent software, even on Windows.

Marc-François
  • 3,450
  • 2
  • 22
  • 42
  • Well, you *can* write UTF-8-based software on Windows (I've done it), but you have to avoid functions like `fopen` that take an "ANSI" string :-( – dan04 Jan 15 '11 at 00:48
  • What? fopen? In what language? Did I say it was impossible to write software on Windows that is UTF-8 based? I don't understand your point. Or maybe someone deleted his comment. – Marc-François Jan 15 '11 at 06:49
0

If UTF-32 is available, prefer that over the other versions for processing.

If your platform supports UTF-32/UCS-4 Unicode natively - then the "compressed" versions UTF-8 and UTF-16 may be slower, because they use varying numbers of bytes for each character (character sequences), which makes impossible to do a direct lookup in a string by index, while UTF-32 uses 32 bit "flat" for each character, speeding up some string operations a lot.

Of course, if you are programming in a very restricted environment like, say, embedded systems and can be certain there will be only ASCII or ISO 8859-x characters around, ever, then you can chose those charsets for efficiency and speed. But in general, stick with the Unicode Transformation Formats.

foo
  • 1,764
  • 1
  • 22
  • 30
  • 3
    UTF-32 takes 4x the space of ASCII (or UTF-8 when encoding ASCII characters) for the same data. This can definitely matter. Plus, unlike the "legacy" charsets like ISO-8859-* (and unlike UTF-8), you have byte-order endianness issues with UTF-32 and UTF-16. – dkarp Jan 15 '11 at 02:45
  • ["UTF-32 (or UCS-4) is a protocol for encoding Unicode characters that uses exactly 32 bits for each Unicode code point. All other Unicode transformation formats use variable-length encodings. The UTF-32 form of a character is a direct representation of its codepoint."](http://en.wikipedia.org/wiki/UTF-32/UCS-4) – dkarp Jan 16 '11 at 16:03
  • 1
    @dkarp: that's why I wrote "for processing" in the first sentence. For storage, you may want consider storage formats or compression, depending on environment, the speed of the components, the frequency the strings are accessed and other factors. Optimisation is rarely done on one factor alone. -- But the primary factor is, as I wrote, the platform support. Windows, for example, used UTF-16 internally the last time I looked, so going with UTF-16 will be best there, leaving string operation optimisation to the platform/library provider. – foo Jan 17 '11 at 11:07
  • @foo Sorry, but I don't buy it. If you don't want to do input in UTF-32 and you don't want to do output in UTF-32 and you don't want to store bloated UTF-32 strings in memory, what's the win? UTF-32 isn't even one character/grapheme per 32 bits, it's one *code point* per 32 bits. [Combining characters, canonical equivalence, joy.](http://unicode.org/faq/char_combmark.html) There's a reason that very few platforms and applications use UTF-32 -- the benefits generally do **not** outweigh the costs. – dkarp Jan 17 '11 at 13:57
  • @dkarp: You are correct about the difference between code points and characters; yet, the troubles with varying run-length holds true, including cache/access speed aspects. So there *are* points for and against. You could call UTF-16 "bloated" as well from a UTF-8/8-Bit-charset perspective; yet many platform makers decided to go with it, probably seeing the best balance of tradeoffs here - Java does it by now, Windows does it by now, Mac OS does, Qt and probably a number more use UTF-16. (Obviously accepting the necessity for byte-order handling). – foo Jan 17 '11 at 19:59
  • @dkarp: But I've seen Python on Linux using UTF-32, and the "bloat" is reported to be "neglegible", see http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python . Several other *ix platforms prefer UTF-32 as well. So I come back to what I wrote before: Use what your platform provides / prefers - as long as it is an Unicode representation. You Don't Want To Write Unicode Handling Yourself. – foo Jan 17 '11 at 19:59
-1

It is well-known that utf-8 works best for file storage and network transport. But people debate whether utf-16/32 are better for processing. One major argument is that utf-16 is still variable length and even utf-32 is still not one code-point per character, so how are they better than utf-8? My opinion is that utf-16 is a very good compromise.

First, characters out side of BMP which need double code-points in utf-16 are extremely rarely used ones. The Chinese characters (also some other Asia characters) in that range are basically dead ones. Ordinary people won't use them at all, except experts use them to digitalize ancient books. So, utf-32 will be a waste most of the time. Don't worry too much about those characters, as they won't make your software look bad if you didn't handle them properly, as long as your software is not for those special users.

Second, often we need the string memory allocation to be related to character count. e.g. a database string column for 10 characters (assuming we store unicode string in normalized form), which will be 20 bytes for utf-16. In most cases it will work just like that, except in extreme cases it will hold only 5-8 characters. But for utf-8, the common byte length of one character is 1-3 for western languages and 3-5 for Asia languages. Which means we need 10-50 bytes even for the common cases. More data, more processing.

Dudu
  • 1,074
  • 11
  • 13
  • I disagree with "Don't worry too much about those characters, as they won't make your software look bad if you didn't handle them properly". Saying "My program uses/supports UTF-16" when you mean "My program uses/supports a subset of UTF-16" is either disingenuous or an outright lie. Bugs are one thing; intentionally not supporting the whole of UTF-16 is not a bug. – Kevin Jul 26 '17 at 22:42