Because I was requested to, I’ll do some necromancy. The other answers were from 2009, but this article still came up on a search I did in 2018. The situation today is very different. Also, the accepted answer was incomplete even back in 2009.
The Source Character Set
Every compiler (including Microsoft’s Visual Studio 2008 and later, gcc, clang and icc) will read UTF-8 source files that start with BOM without a problem, and clang will not read anything but UTF-8, so UTF-8 with a BOM is the lowest common denominator for C and C++ source files.
The language standard doesn’t say what source character sets the compiler needs to support. Some real-world source files are even saved in a character set incompatible with ASCII. Microsoft Visual C++ in 2008 supported UTF-8 source files with a byte order mark, as well as both forms of UTF-16. Without a byte order mark, it would assume the file was encoded in the current 8-bit code page, which was always a superset of ASCII.
The Execution Character Sets
In 2012, the compiler added a /utf-8
switch to CL.EXE
. Today, it also supports the /source-charset
and /execution-charset
switches, as well as /validate-charset
to detect if your file is not actually UTF-8. This page on MSDN has a link to the documentation on Unicode support for every version of Visual C++.
Current versions of the C++ standard say the compiler must have both an execution character set, which determines the numeric value of character constants like 'a'
, and a execution wide-character set that determines the value of wide-character constants like L'é'
.
To language-lawyer for a bit, there are very few requirements in the standard for how these must be encoded, and yet Visual C and C++ manage to break them. It must contain about 100 characters that cannot have negative values, and the encodings of the digits '0'
through '9'
must be consecutive. Neither capital nor lowercase letters have to be, because they weren’t on some old mainframes. (That is, '0'+9
must be the same as '9'
, but there is still a compiler in real-world use today whose default behavior is that 'a'+9
is not 'j'
but '«'
, and this is legal.) The wide-character execution set must include the basic execution set and have enough bits to hold all the characters of any supported locale. Every mainstream compiler supports at least one Unicode locale and understands valid Unicode characters specified with \Uxxxxxxxx
, but a compiler that didn’t could claim to be complying with the standard.
The way Visual C and C++ violate the language standard is by making their wchar_t
UTF-16, which can only represent some characters as surrogate pairs, when the standard says wchar_t
must be a fixed-width encoding. This is because Microsoft defined wchar_t
as 16 bits wide back in the 1990s, before the Unicode committee figured out that 16 bits were not going to be enough for the entire world, and Microsoft was not going to break the Windows API. It does support the standard char32_t
type as well.
UTF-8 String Literals
The third issue this question raises is how to get the compiler to encode a string literal as UTF-8 in memory. You’ve been able to write something like this since C++11:
constexpr unsigned char hola_utf8[] = u8"¡Hola, mundo!";
This will encode the string as its null-terminated UTF-8 byte representation regardless of whether the source character set is UTF-8, UTF-16, Latin-1, CP1252, or even IBM EBCDIC 1047 (which is a silly theoretical example but still, for backward-compatibility, the default on IBM’s Z-series mainframe compiler). That is, it’s equivalent to initializing the array with { 0xC2, 0xA1, 'H', /* ... , */ '!', 0 }
.
If it would be too inconvenient to type a character in, or if you want to distinguish between superficially-identical characters such as space and non-breaking space or precomposed and combining characters, you also have universal character escapes:
constexpr unsigned char hola_utf8[] = u8"\u00a1Hola, mundo!";
You can use these regardless of the source character set and regardless of whether you’re storing the literal as UTF-8, UTF-16 or UCS-4. They were originally added in C99, but Microsoft supported them in Visual Studio 2015.
Edit: As reported by Matthew, u8"
strings are buggy in some versions of MSVC, including 19.14. It turns out, so are literal non-ASCII characters, even if you specify /utf-8
or /source-charset:utf-8 /execution-charset:utf-8
. The sample code above works properly in 19.22.27905.
There is another way to do this that worked in Visual C or C++ 2008, however: octal and hexadecimal escape codes. You would have encoded UTF-8 literals in that version of the compiler with:
const unsigned char hola_utf8[] = "\xC2\xA1Hello, world!";