17

Not being able to wrap my head around this one is a real source of shame...

I'm working with a French version of Visual Studio (2008), in a French Windows (XP). French accents put in strings sent to the output window get corrupted. Ditto input from the output window. Typical character encoding issue, I enter ANSI, get UTF-8 in return, or something to that effect. What setting can ensure that the characters remain in ANSI when showing a "hardcoded" string to the output window?

EDIT:

Example:

#include <iostream>

int main()
{
std:: cout << "àéêù" << std:: endl;

return 0;
}

Will show in the output:

óúÛ¨

(here encoded as HTML for your viewing pleasure)

I would really like it to show:

àéêù

MPelletier
  • 15,130
  • 14
  • 79
  • 128
  • Can you give us a little bit more input. Is this happening for build output, all output or something else? Can you give us a specific operation for which this happens (build, debugging, etc ...) – JaredPar Dec 07 '09 at 03:50
  • Yes, please show an example of what you think should appear and what actually appears. – wallyk Dec 07 '09 at 03:52
  • 1
    What happens if you use wcout? – Naveen Dec 07 '09 at 04:00
  • It's pretty much in all output. Debug, build, watcher, etc. – MPelletier Dec 07 '09 at 04:03
  • @Naveen: no dice, doesn't change anything. – MPelletier Dec 07 '09 at 04:05
  • 1
    I am not sure but i think you should use _T() or L"" to specify unicode strings in visual studio . Can you try that once with wcout? – Naveen Dec 07 '09 at 04:10
  • @Naveen: I tried `L"àéêù"`, to no avail. Not sure how _T() works though... Can you give an actual example please? – MPelletier Dec 07 '09 at 04:19
  • It is used as _T("naveen") but I dont expect it to work as L"" is not working . May be some other issue .. – Naveen Dec 07 '09 at 04:26
  • 1
    Why aren't you using wide strings? That's how Windows implements Unicode support – jalf Dec 07 '09 at 11:32
  • All characters of the French language are supported without Unicode as Extended Ascii. That's why I don't use Unicode, I shouldn't need it. – MPelletier Dec 09 '09 at 01:41
  • Windows does not understand "Extended Ascii". It only understands locale specific codepages (likely defaults to 1252 for your machine) and unicode. – Bahbar Dec 17 '09 at 08:38
  • 2
    Since this answer came up in a Google search in 2018, I’ll leave a comment. C++11 or later supports UTF-8 for the execution character set with `u"..."`. Visual C++ 2008 supported UTF-8 with a BOM as the source character set, and current versions support it without a BOM with the `/UTF-8` switch. Other compilers, including gcc, clang and icc, support it too. The language standard has, for years, allowed compilers to support any source and execution character sets they want so long as they contain a minimal set of basic characters. – Davislor Apr 23 '18 at 16:01
  • @Davislor Thank you for this! Could you write this as an answer? It's certainly pertinent and that would bring it attention. – MPelletier Apr 23 '18 at 17:16
  • @MPelletier Done. Maybe a little overdone. – Davislor Apr 23 '18 at 20:49

8 Answers8

15

Before I go any further, I should mention that what you are doing is not c/c++ compliant. The specification states in 2.2 what character sets are valid in source code. It ain't much in there, and all the characters used are in ascii. So... Everything below is about a specific implementation (as it happens, VC2008 on a US locale machine).

To start with, you have 4 chars on your cout line, and 4 glyphs on the output. So the issue is not one of UTF8 encoding, as it would combine multiple source chars to less glyphs.

From you source string to the display on the console, all those things play a part:

  1. What encoding your source file is in (i.e. how your C++ file will be seen by the compiler)
  2. What your compiler does with a string literal, and what source encoding it understands
  3. how your << interprets the encoded string you're passing in
  4. what encoding the console expects
  5. how the console translates that output to a font glyph.

Now...

1 and 2 are fairly easy ones. It looks like the compiler guesses what format the source file is in, and decodes it to its internal representation. It generates the string literal corresponding data chunk in the current codepage no matter what the source encoding was. I have failed to find explicit details/control on this.

3 is even easier. Except for control codes, << just passes the data down for char *.

4 is controlled by SetConsoleOutputCP. It should default to your default system codepage. You can also figure out which one you have with GetConsoleOutputCP (the input is controlled differently, through SetConsoleCP)

5 is a funny one. I banged my head to figure out why I could not get the é to show up properly, using CP1252 (western european, windows). It turns out that my system font does not have the glyph for that character, and helpfully uses the glyph of my standard codepage (capital Theta, the same I would get if I did not call SetConsoleOutputCP). To fix it, I had to change the font I use on consoles to Lucida Console (a true type font).

Some interesting things I learned looking at this:

  • the encoding of the source does not matter, as long as the compiler can figure it out (notably, changing it to UTF8 did not change the generated code. My "é" string was still encoded with CP1252 as 233 0 )
  • VC is picking a codepage for the string literals that I do not seem to control.
  • controlling what the console shows is more painful than what I was expecting

So... what does this mean to you ? Here are bits of advice:

  • don't use non-ascii in string literals. Use resources, where you control the encoding.
  • make sure you know what encoding is expected by your console, and that your font has the glyphs to represent the chars you send.
  • if you want to figure out what encoding is being used in your case, I'd advise printing the actual value of the character as an integer. char * a = "é"; std::cout << (unsigned int) (unsigned char) a[0] does show 233 for me, which happens to be the encoding in CP1252.

BTW, if what you got was "ÓÚÛ¨" rather than what you pasted, then it looks like your 4 bytes are interpreted somewhere as CP850.

Bahbar
  • 16,994
  • 38
  • 60
  • Using resources.. Definitely gotta look into that. Here's where it gets tougher though: The console acts as a filter of sorts, because if I "cin>>" some accented letters, lo and behold, funny characters are gotten on the other side! I'm not at that machine at the moment, but I will try to reoutput what I get from cin and see if it gets garbled further or reverts back. – MPelletier Dec 08 '09 at 14:13
  • Excellent answer. I shall certainly make a note of this. – Charles Anderson Dec 16 '09 at 14:58
  • This answer is quite useful to understand what happens to the raw bytes of the source code file for a string literal through the process of compilation and through to the runtime system. Perhaps you might have a look at http://stackoverflow.com/questions/27871124/does-the-multibyte-to-wide-string-conversion-function-mbstowcs-when-passed-a? – Dan Nissenbaum Jan 10 '15 at 02:41
  • Specifically - perhaps, if you have time, you could address how the compiler's `internal represenation` of the bytes of the raw string literal (that you mention in your answer) correspond to the C++ standard's `execution character set`? Thanks! – Dan Nissenbaum Jan 10 '15 at 02:44
  • @DanNissenbaum: Well, as it happens, you seem to have delved way deeper in the possibilities than I actually know. FYI, I didn't know anything about the subject before I typed this answer. I was just curious :). But I'd stick to my first advice: don't use non-ascii in string literals - not just because you don't control the encoding, but also because if it's not ascii, chances are it's something you'll want to localize in the future – Bahbar Jan 14 '15 at 11:04
6

Try this:

#include <iostream>
#include <locale>

int main()
{
 std::locale::global(std::locale(""));
 std::cout << "àéêù" << std::endl;

 return 0;
}
ruf
  • 61
  • 1
  • 1
  • 1
    Nice, but this seems to only work for output, the input received from the console is still random gibberish. – KáGé Nov 23 '13 at 18:42
5

Because I was requested to, I’ll do some necromancy. The other answers were from 2009, but this article still came up on a search I did in 2018. The situation today is very different. Also, the accepted answer was incomplete even back in 2009.

The Source Character Set

Every compiler (including Microsoft’s Visual Studio 2008 and later, gcc, clang and icc) will read UTF-8 source files that start with BOM without a problem, and clang will not read anything but UTF-8, so UTF-8 with a BOM is the lowest common denominator for C and C++ source files.

The language standard doesn’t say what source character sets the compiler needs to support. Some real-world source files are even saved in a character set incompatible with ASCII. Microsoft Visual C++ in 2008 supported UTF-8 source files with a byte order mark, as well as both forms of UTF-16. Without a byte order mark, it would assume the file was encoded in the current 8-bit code page, which was always a superset of ASCII.

The Execution Character Sets

In 2012, the compiler added a /utf-8 switch to CL.EXE. Today, it also supports the /source-charset and /execution-charset switches, as well as /validate-charset to detect if your file is not actually UTF-8. This page on MSDN has a link to the documentation on Unicode support for every version of Visual C++.

Current versions of the C++ standard say the compiler must have both an execution character set, which determines the numeric value of character constants like 'a', and a execution wide-character set that determines the value of wide-character constants like L'é'.

To language-lawyer for a bit, there are very few requirements in the standard for how these must be encoded, and yet Visual C and C++ manage to break them. It must contain about 100 characters that cannot have negative values, and the encodings of the digits '0' through '9' must be consecutive. Neither capital nor lowercase letters have to be, because they weren’t on some old mainframes. (That is, '0'+9 must be the same as '9', but there is still a compiler in real-world use today whose default behavior is that 'a'+9 is not 'j' but '«', and this is legal.) The wide-character execution set must include the basic execution set and have enough bits to hold all the characters of any supported locale. Every mainstream compiler supports at least one Unicode locale and understands valid Unicode characters specified with \Uxxxxxxxx, but a compiler that didn’t could claim to be complying with the standard.

The way Visual C and C++ violate the language standard is by making their wchar_t UTF-16, which can only represent some characters as surrogate pairs, when the standard says wchar_t must be a fixed-width encoding. This is because Microsoft defined wchar_t as 16 bits wide back in the 1990s, before the Unicode committee figured out that 16 bits were not going to be enough for the entire world, and Microsoft was not going to break the Windows API. It does support the standard char32_t type as well.

UTF-8 String Literals

The third issue this question raises is how to get the compiler to encode a string literal as UTF-8 in memory. You’ve been able to write something like this since C++11:

constexpr unsigned char hola_utf8[] = u8"¡Hola, mundo!";

This will encode the string as its null-terminated UTF-8 byte representation regardless of whether the source character set is UTF-8, UTF-16, Latin-1, CP1252, or even IBM EBCDIC 1047 (which is a silly theoretical example but still, for backward-compatibility, the default on IBM’s Z-series mainframe compiler). That is, it’s equivalent to initializing the array with { 0xC2, 0xA1, 'H', /* ... , */ '!', 0 }.

If it would be too inconvenient to type a character in, or if you want to distinguish between superficially-identical characters such as space and non-breaking space or precomposed and combining characters, you also have universal character escapes:

constexpr unsigned char hola_utf8[] = u8"\u00a1Hola, mundo!";

You can use these regardless of the source character set and regardless of whether you’re storing the literal as UTF-8, UTF-16 or UCS-4. They were originally added in C99, but Microsoft supported them in Visual Studio 2015.

Edit: As reported by Matthew, u8" strings are buggy in some versions of MSVC, including 19.14. It turns out, so are literal non-ASCII characters, even if you specify /utf-8 or /source-charset:utf-8 /execution-charset:utf-8. The sample code above works properly in 19.22.27905.

There is another way to do this that worked in Visual C or C++ 2008, however: octal and hexadecimal escape codes. You would have encoded UTF-8 literals in that version of the compiler with:

const unsigned char hola_utf8[] = "\xC2\xA1Hello, world!";
Davislor
  • 12,287
  • 2
  • 26
  • 36
  • 1
    It seems you *can't* use UCEs regardless of source character set; VS butchers them into mojibake *even* for UTF-X literals. (OTOH, that's almost certainly a compiler bug...) – Matthew Mar 14 '19 at 20:20
  • @Matthew The bug you reported is fixed in MSVC 19.22.27905. Thanks! – Davislor Sep 08 '19 at 23:35
  • @Matthew I added a note about the version of the compiler I was able to reproduce the bug with and the version that worked, but I’d appreciate more information if you have it. – Davislor Sep 08 '19 at 23:43
3

I tried this code:

#include <iostream>
#include <fstream>
#include <sstream>

int main()
{
    std::wstringstream wss;
    wss << L"àéêù";
    std::wstring s = wss.str();
    const wchar_t* p = s.c_str();
    std::wcout << ws.str() << std::endl;

    std::wofstream file("C:\\a.txt");
    file << p << endl;

    return 0;
}

The debugger showed that wss, s and p all had the expected values (i.e. "àéêù"), as did the output file. However, what appeared in the console was óúÛ¨.

The problem is therefore in the Visual Studio console, not the C++. Using Bahbar's excellent answer, I added:

    SetConsoleOutputCP(1252);

as the first line, and the console output then appeared as it should.

Charles Anderson
  • 16,347
  • 13
  • 52
  • 71
3
//Save As Windows 1252
#include<iostream>
#include<windows.h>

int main()
{
    SetConsoleOutputCP(1252);
    std:: cout << "àéêù" << std:: endl;
}

Visual Studio does not supports UTF 8 for C++, but partially supports for C:

//Save As UTF8 without signature
#include<stdio.h>
#include<windows.h>

int main()
{
    SetConsoleOutputCP(65001);
    printf("àéêù\n");
}
vladasimovic
  • 280
  • 3
  • 5
3

Using _setmode() works¹. and is arguably better than changing the codepage or setting a locale, since it'll actually make your program output in Unicode and thus will be consistent - no matter which codepage or locale are currently set.

Example:

#include <iostream>
#include <io.h>
#include <fcntl.h>

int wmain()
{
    _setmode( _fileno(stdout), _O_U16TEXT );

    std::wcout << L"àéêù" << std::endl;

    return 0;
}


Inside Visual Studio, make sure you set up your project for Unicode (Right-click Project -> Click General -> Character Set = Use Unicode Character Set).

MinGW users:

  1. Define both UNICODE and _UNICODE
  2. Add -finput-charset=iso-8859-1 to the compiler options to get around this error: "converting to execution character set: Invalid argument"
  3. Add -municode to the linker options to get around "undefined reference to `WinMain@16" (read more).


Edit: The equivalent call to set unicode input is: _setmode( _fileno(stdin), _O_U16TEXT );

Edit 2: An important piece of information, specially considering the question uses std::cout. This is not supported. The MSDN Docs states (emphasis mine):

Unicode mode is for wide print functions (for example, wprintf) and is not supported for narrow print functions. Use of a narrow print function on a Unicode mode stream triggers an assert.

So, don't use std::cout when the console output mode is _O_U16TEXT; similarly, don't use std::cin when the console input is _O_U16TEXT. You must use the wide version of these facilities (std::wcout, std::wcin).
And do note that mixing cout and wcout in the same output is not allowed (but I find it works if you call flush() and then _setmode() before switching between the narrow and wide operations).

Marc.2377
  • 5,840
  • 5
  • 43
  • 75
  • @Nikos `SetConsoleCP()` is moot because, if the input is unicode, the codepage does not really matter. You can read about codepages vs unicode in this [Joel post](https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/). Check my edit to see how to set unicode input. – Marc.2377 Sep 16 '19 at 18:07
1

Make sure you do not forget to change the console's font to Lucida Consolas as mentionned by Bahbar : it was crucial in my case (French win 7 64 bit with VC 2012).

Then as mentionned by others use SetConsoleOutputCP(1252) for C++ but it may fail depending on the available pages so you might want to use GetConsoleOutputCP() to check that it worked or at least to check that SetConsoleOutputCP(1252) returns zero. Changing the global locale also works (for some reason there is no need to do cout.imbue(locale()); but it may break some librairies!

In C, SetConsoleOutputCP(65001); or the locale-based approach worked for me once I had saved the source code as UTF8 without signature (scroll down, the sans-signature choice is way below in the list of pages).

Input using SetConsoleCP(65001); failed for me apparently due to a bad implementation of page 65001 in windows. The locale approach failed too both in C and C++. A more involved solution, not relying on native chars but on wchar_t seems required.

Mikal
  • 19
  • 4
0

I had the same problem with Chinese input. My source code is utf8 and I added /utf-8 in the compiler option. It works fine under c++ wide-string and wide-char but not work under narrow-string/char which it shows Garbled character/code in Visual Studio 2019 debugger and my SQL database. I have to use the narrow characters because of converting to SQLAPI++'s SAString. Eventually, I find checking the following option (contorl panel->Region->Administrative->Change system locale) can resolve the issue. I know it is not an ideal solution but it does help me.

enter image description here

MartenCatcher
  • 2,209
  • 7
  • 22
  • 31
Gary
  • 1