66

From here

Essentially, string uses the UTF-16 character encoding form

But when saving vs StreamWriter :

This constructor creates a StreamWriter with UTF-8 encoding without a Byte-Order Mark (BOM),

I've seen this sample (broken link removed):

enter image description here

And it looks like utf8 is smaller for some strings while utf-16 is smaller in some other strings.

  • So why does .net use utf16 as default encoding for string and utf8 for saving files?

Thank you.

p.s. Ive already read the famous article

David Klempfner
  • 6,679
  • 15
  • 47
  • 102
Royi Namir
  • 131,490
  • 121
  • 408
  • 714
  • 1
    [This post](http://blog.coverity.com/2014/04/09/why-utf-16/#.U1pXbvldWCk) from Eric Lippert goes into more details of why the decision was made. – Lukazoid Apr 25 '14 at 12:40
  • @Lukazoid Great post but note the comments, where Hans Passant disagrees with a convincing argument. – Ohad Schneider Jun 21 '14 at 21:52
  • 2
    Working version of @Lukazoid's link: https://web.archive.org/web/20161121052650/http://blog.coverity.com/2014/04/09/why-utf-16/ – Ian Kemp Nov 07 '18 at 06:14
  • The short answer is that UTF16 is not portable, while UTF8 is super portable. – Zoltan Tirinda Mar 26 '19 at 13:26

3 Answers3

59

If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

Of course, we could use UTF-32 as the char representation, but UTF-8 in the string representation, converting as we go.

The two disadvantages of UTF-16 are:

  • The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
  • For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)

Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929
  • 2
    the point of UTF-8 is that, if you need 6 bytes per character to truly represent all possibilities, then anything less than UTF-32 is a problem that needs special cases and extra code. So UTF-16 and UTF-8 are both imperfect. However, as UTF-8 is half the size, you might as well use that. You gain nothing by using UTF-16 over it (except increased file/string sizes). Of course, some people will use UTF-16 and ignorantly assume it handles all characters. – gbjbaanb Feb 18 '13 at 17:47
  • Can you please elaborate on _"UTF-8 in the string representation, converting as we go"_ ? Both of them (utf8,16) has variable width... – Royi Namir Feb 18 '13 at 18:26
  • 2
    I've read it 14 times. still I don't understand this line : _the size per code unit being constant_ . AFAIK the size can be 2,3,4 bytes (in utf-16) so what is constant here ? – Royi Namir Feb 18 '13 at 20:08
  • I think he means UCS-16, which is what Windows calls "Unicode" - ie a fixed 2-byte-per-character encoding. Back in the day, we thought this was enough to store all character encodings. We were wrong, hence UTF-8 being a internet "standard" now. – gbjbaanb Feb 19 '13 at 12:57
  • 1
    @gbjbaanb: No, .NET uses UTF-16. So when anything outside the BMP is required, surrogate pairs are used. Each character is a UTF-16 code unit. (As far as I'm aware there's no such thing as UCS-16 either - I think you mean UCS-2.) – Jon Skeet Feb 19 '13 at 13:19
  • 11
    @RoyiNamir: No, the size of a UTF-16 code unit is *always* 2 bytes. A Unicode character takes either one code unit (for the Basic Multilingual plane) or two code units (for characters U+10000 and above). – Jon Skeet Feb 19 '13 at 13:20
  • phewww...Thanks Jon. I though you forgot me.... Now it is clear .Again Thanks a lot. – Royi Namir Feb 19 '13 at 13:44
  • But Jon , looking at the table in my question , lets say the "hello world" was saved in an xml file which was saved with utf 8 encoding . Later , i open the file in vs and i do(!) see the xml with the ״hello world" -- so , visual studio efitor knew(!) how to open the file . It knew how to decode the bytes on the hard drive . So --- why still i need to declare the charset encoding tag- at the top of the xml ? – Royi Namir Jan 24 '14 at 16:48
  • @RoyiNamir: Specifically for XML, you don't *need* to have an encoding tag if you're using UTF-8 or UTF-16. The specification explains how the encoding can be inferred from the first few characters. For any other encoding, you *must* include the encoding. (Note that your question doesn't mention XML anywhere...) – Jon Skeet Jan 24 '14 at 17:06
  • Jon , I'm sorry but why other encodings are different ? I mean , if my program reads a remote file as bytes , and the encoding can be inferred from the first bytes , then why mention it again at the tag ? It seems weird to me the fact that i can open a safe via a clue(first bytes) , and when i open it , i see again a clue inside (charset tag)... . I must be missing something here . :-( – Royi Namir Jan 24 '14 at 17:53
  • @RoyiNamir: It can only be inferred between UTF-8 and UTF-16, and those are the only ones the XML specification dictates are okay to leave out. And that's just for XML - for other text files, editors are left to guess heuristically, and can get it wrong. – Jon Skeet Jan 24 '14 at 17:55
  • @JonSkeet: But If they can get it wrong (and let's say they _are_ getting it wrong)-- how would they know to "read properly" the encoding data at `<...charset _any_="" all="" answer="" answers.="" article="" didnt="" find="" for="" here.="" i="" latest="" my="" questions="" section="" thank="" the="" which="" you=""> – Royi Namir Jan 25 '14 at 08:53
  • @RoyiNamir: To be honest, this should all be as a new question, given that it's XML-specific. But I believe there's an assumption that the other encodings will be compatible with ASCII, but I suspect that XML parsers which support non-ASCII-compatible encodings can try those as well. See http://www.w3.org/TR/xml/#charencoding for more information. – Jon Skeet Jan 25 '14 at 08:58
  • @JonSkeet: what am I missing? as far as I know, UTF-16 can be 2 or 4 bytes. All internet resources show the same thing: [Wiki](https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings) or [Unicode.org](http://unicode.org/faq/utf_bom.html) – Guy Sep 21 '15 at 12:25
  • @gMorphus: A UTF-16 code unit is always 2 bytes. A Unicode code point is represented by one or two UTF-16 code units. I'm not sure which part of either the answer or the comments you're disagreeing with. – Jon Skeet Sep 21 '15 at 12:31
  • @JonSkeet "It could be up to 6 bytes". In the Thompson-Pike UTF-8 proposal (Ken Thompson and Rob Pike) the posible range of characters was [0, 7FFFFFFF], requiring up to 6-bytes (or octets: 8-bit bytes). In 2003, the range of characters was restricted to [0, 10FFFF] (the UTF-16 accessible range). See: https://tools.ietf.org/html/rfc3629 So, all characters are encoded using sequences of 1 to 4 octets. Not 6. – Fernando Pelliccioni Jul 28 '16 at 22:58
  • @JonSkeet, RoyiNamir said: "(utf8,16) has variable width". I understand he means that UTF-16 not a variable-width encoding, and he is right. You answered: "the size of a UTF-16 code unit is always 2 bytes". And..., the size of a UTF-8 code unit is always 1 byte. – Fernando Pelliccioni Jul 28 '16 at 23:08
  • 1
    @FernandoPelliccioni: How do you define "variable-width encoding" precisely? Having just reread definitions, I agree I was confused about the precise meaning of "code unit" but both UTF-8 and UTF-16 are variable width in terms of "they can take a variable number of bytes to represent a single Unicode code point". For UTF-8 it's 1-4 bytes, for UTF-16 it's 2 or 4. Will check over the rest of my answer for precision now. – Jon Skeet Jul 29 '16 at 06:04
  • @FernandoPelliccioni: I've fixed the "up to 6 bytes part" btw. – Jon Skeet Jul 29 '16 at 06:10
  • 1
    @FernandoPelliccioni: Thanks for the prod to revisit this, btw - and always nice to get more precise about terms – Jon Skeet Jul 29 '16 at 06:18
  • Thank you @JonSkeet. You're always so kind to help others with your knowledge. Here are some references I consider a good read. (Beyond the standards) http://utf8everywhere.org/ http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful – Fernando Pelliccioni Jul 29 '16 at 09:59
  • @JonSkeet Here another reference of the meaning of "variable-width encoding" by the Unicode guys. http://www.unicode.org/versions/Unicode8.0.0/UnicodeStandard-8.0.pdf (see pages [36-39]) – Fernando Pelliccioni Jul 29 '16 at 10:59
  • @FernandoPelliccioni: Right, that concurs that UTF-16 is variable-width. "The distinction between characters represented with one versus two 16-bit code units means that formally UTF-16 is a variable-width encoding form." – Jon Skeet Jul 29 '16 at 11:09
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/118641/discussion-between-fernando-pelliccioni-and-jon-skeet). – Fernando Pelliccioni Jul 29 '16 at 13:41
  • Does .Net actually use UTF-16 internally though? From what I've seen, `Char` is just a 16-bit struct, and `String` is just an array of `Char`. There's no variable width, it's a plain dump of unicode code points that can't go above 0xFFFF. – Nyerguds Feb 22 '21 at 11:10
  • 1
    @Nyerguds: `char` is a 16-bit struct, yes. But it uses surrogate pairs, so the number of Unicode *code points* in a string is not just the number of chars, unless you count each half as its own code point. – Jon Skeet Feb 22 '21 at 11:29
  • @JonSkeet Interesting, thx. – Nyerguds Feb 22 '21 at 11:35
33

As with many "why was this chosen" questions, this was determined by history. Windows became a Unicode operating system at its core in 1993. Back then, Unicode still only had a code space of 65535 codepoints, these days called UCS. It wasn't until 1996 until Unicode acquired the supplementary planes to extend the coding space to a million codepoints. And surrogate pairs to fit them into a 16-bit encoding, thus setting the utf-16 standard.

.NET strings are utf-16 because that's an excellent fit with the operating system encoding, no conversion is required.

The history of utf-8 is murkier. Definitely past Windows NT, RFC-3629 dates from November 1993. It took a while to gain a foot-hold, the Internet was instrumental.

Hans Passant
  • 873,011
  • 131
  • 1,552
  • 2,371
11

UTF-8 is the default for text storage and transfer because it is a relatively compact form for most languages (some languages are more compact in UTF-16 than in UTF-8). Each specific language has a more efficient encoding.

UTF-16 is used for in-memory strings because it is faster per character to parse and maps directly to unicode character class and other tables. All string functions in Windows use UTF-16 and have for years.

Remy Lebeau
  • 454,445
  • 28
  • 366
  • 620
user2457603
  • 121
  • 1
  • 4