36

I'm not quite pro with encodings, but here's what I think I know (though it may be wrong):

  1. ASCII is a 7-bit, fixed-length encoding, with the characters you can find in ASCII charts.
  2. UTF8 is an 8-bit, variable-length encoding. All characters can be written in UTF8.
  3. UCS-2 LE/BE are fixed-length, 16-bit encodings that support most common characters.
  4. UTF-16 is a 16-bit, variable-length encoding. All characters can be written in UTF16.

Are those above all correct?

Now, for the questions:

  1. Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?
  2. Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.
  3. In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?
  4. Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)
  5. Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?
  6. Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?
  7. What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?
  8. (I had more questions, but this is enough... I forgot some of them anyway...)

These are a lot of questions, so any links to explanations about how all these connect (aside from reading the Unicode standard, which won't help with the Windows API anyway) would also be greatly appreciated.

Thank you!

user541686
  • 189,354
  • 112
  • 476
  • 821
  • 1
    Why won't the Unicode standard help with Windows? My preferred reference, for what it's worth, is the O'Reilly book: http://oreilly.com/catalog/9780596101213/ – David Heffernan Jan 04 '11 at 12:15
  • 1
    @David: Because it can't answer questions about A vs W functions. But thanks for the reference to the book, it seems interesting. – user541686 Jan 07 '11 at 16:42
  • 2
    It's a good book. Knowing more general background on Unicode does help understanding the specifics and in particular you'll have a clearer idea as to why the Windows API is the way it is. – David Heffernan Jan 07 '11 at 16:43

4 Answers4

29

Are those above all correct?

Yes, if you don't assume the existence of characters not encoded in Unicode (for most practical applications, this assumption is fine).

Do the Windows "A" functions (like SetWindowTextA) take in ASCII strings? Or "multi-byte strings" (more questions on this below)?

They take byte strings (i.e., strings whose code unit is a byte, which is always an octet on Windows) encoded in the current "ANSI"/MBCS/legacy encoding. "ANSI" is the historical terms for these encodings, but not correct. For Western Windows systems, this encoding is usually Windows-1252.

Do the Windows "W" functions take in UTF-16 strings or UCS-2 strings? I thought they take in UCS-2, but the names confuse me.

Since Windows 2000, most of them support UTF-16. The name "wide" and the rest of the Microsoft terminology (e.g., "Unicode" meaning "UTF-16" or "UCS") were chosen before the modern Unicode standard unified the terminology.

In WideCharToMultiByte, Microsoft uses the word "wide-character string" to mean UTF-16. In that context, then what is considered a "multi-byte string"? UTF-8?

Every other encoding that WideCharToMultiByte supports is a "multi-byte encoding" in this context, including Windows-1251 and UTF-8.

Is LPWSTR a "wide-character string"? I would say it is, but then, wouldn't that mean it's UTF-16? And wouldn't that mean that it could be used to display, say, 4-byte characters? If not, then... is displaying 4-byte characters impossible? (Windows doesn't seem to have APIs for those.)

LPWSTR is a pointer to wchar_t which is always a 16-bit unsigned integer on Windows. Which characters can be displayed is unrelated to the encoding as long as that encoding can encode all Unicode characters. Windows is generally able to display non-BMP characters, but not everywhere (e.g., the console cannot).

Is the functionality of WideCharToMultiByte a superset of that of wcstombs, and do they both work on the same type of string? Or does one, say, work on UTF-16 while the other works on UCS-2?

Don't really know, but I don't think they differ too much. I suppose you just try to convert some non-BMP character to UTF-8 and look whether the result is correct.

Are file paths in UTF-16 or UCS-2? I know Windows treats it as an "opaque array of characters" from Microsoft's documentation, but per the C standard for functions like fwprintf, is there any standardized encoding?

File paths are indeed opaque arrays of UTF-16 characters, meaning that Windows doesn't perform any kind of translation when storing or reading file names (like Linux and unlike Mac OS X). But Windows still has its weird mostly-undefined case insensitive behavior which causes much trouble because file names that are treated equivalent aren't necessarily equal. That breaks many invariants; for example, on Linux without interference from other threads, if you successfully create two files A and a in some directory, you'll end up with two distinct files, while on Windows you get only one file (and in general, an unpredictable number of files).

What is "ANSI" encoding? Is that even a correct term? And how does it relate to ASCII?

ANSI is the American standardization organization. Using this word when referring to encodings is a misnomer, but a frequent one, so you should be aware of it. I prefer the term legacy 8-bit encoding, because I think that's essentially what it is: a non-Unicode encoding that is kept only for compatibility with legacy (Windows 9x) applications. On Western systems, this is usually Windows-1252, which is a proper superset of ASCII.

Philipp
  • 43,805
  • 12
  • 78
  • 104
  • 2
    The case-sensitivity is a property of the filesystem. In NTFS, it is defined by a lower-to-upper case map stored in a hidden file which is created when the filesystem is formatted. Hence it can vary (slightly) depending on what locale the filesystem was formatted in. – Ben Jul 11 '14 at 09:57
7
  1. *A functions used the active ANSI codepage.

  2. *W function use UTF-16.

  3. Multi-byte refers to whatever is passed in the CodePage parameter. It is most commonly either the active ANSI codepage or UTF-8.

  4. LPWSTR is a UTF-16 string which may or may not be null-terminated (see MSDN)

  5. I don't know anything about wcstombs, I always use WideCharToMultiByte.

  6. File paths are in UTF-16. In fact all text is UTF-16 internally in Windows.

  7. For ANSI encoding you will need to read up on that in some detail. You could do worse than to start with Wikipedia and follow the links from there.

I hope that helps and that if I've got anything wrong, anyone who knows more please do edit this to correct any errors!

David Heffernan
  • 572,264
  • 40
  • 974
  • 1,389
6

Wide strings used to be UCS-2. From Windows 2000, wide strings are UTF-16. Good to know if you need to maintain some old legacy system.

Jörgen Sigvardsson
  • 4,626
  • 2
  • 24
  • 50
1

First of all you'll find plenty of information in this SO topic.

ASCII is a charset, not encoding. Now, there's a number of 8-bit charsets, one of them being set as default in the system (you can change it in Regional Settings). *A functions accept 8-bit characters in that charset. UTF-8 is not a charset, but encoding of Unicode charset. *W functions, as I understand, use UTF-16 rather than UCS-2.

Community
  • 1
  • 1
Eugene Mayevski 'Callback
  • 43,492
  • 7
  • 62
  • 119
  • Thank you so much for the link! Something that confuses me, though: If the *W functions are UTF-16, then how come Microsoft says, ["the file system treats path and file names as an opaque sequence of WCHARs"](http://msdn.microsoft.com/en-us/library/aa365247%28v=vs.85%29.aspx)? – user541686 Jan 04 '11 at 10:17
  • 1
    @Lambert and what's the problem with this statement? It means that Windows doesn't perform any interpretation of the passed file name, i.e. if surrogate characters are included, Windows doesn't care about them. I think specialists in Unicode will be able to explain more. – Eugene Mayevski 'Callback Jan 04 '11 at 10:34
  • 1
    That's not really the problem -- the problem is, that means that you can pass in invalid non-Unicode data, and it would still work. Is that correct? – user541686 Jan 04 '11 at 10:42
  • 1
    @Lambert yes, sort of. Windows will accept anything except \0 and forbidden characters (slashes, quotes, question and asterisk). This is exactly what they are saying there - that Windows doesn't care about validity of unicode characters passed. – Eugene Mayevski 'Callback Jan 04 '11 at 10:55