How to handle UTF-8 encoded source when compiling on Windows?

Question

I'm currently writing a small C program, using MinGW's gcc to compile it on Windows. I'm also hosting it on GitHub (and using GitHub Desktop for Windows). GitHub, however, appears to enforce UTF-8 encoding in the files and the Windows Terminal have trouble dealing with UTF-8.

After some searching I found a few solutions, but they require manual, end-user style solutions, which I want to avoid (I'm not planning on distributing it or anything, but I wonder what I would do if I was).

What currently works is changing the encoding to ANSI and manually fixing everything before compilation, but I would rather avoid having to do that every damn time I want to work on Windows.

So the question is: How to handle UTF-8 encoded source when compiling on Windows?

Here's some sample output:

[ Screenshot ]

Left: Source Code encoded in UTF-8 (displayed wrong).
Right: Source code encoded in ANSI (displayed right).

Compilation process is exactly the same, only difference is the actual source-code encoding.

I hope I didn't left any relevant information out, but if I did please ask! — NomeQueEuLembro, Mar 13 '17 at 16:49
GCC handles it fine, but the problem appears to be caused by Windows Terminal. Apparently it only works with the Lucida Console font, but I don't want to have to change the terminal font just so my program runs. — NomeQueEuLembro, Mar 13 '17 at 16:55
@HansPassant I'm not opening any files! My source-code is encoded in UTF-8 and the compiled file isn't encoded correctly on Windows. When transcoding my source to ANSI everything works fine! -- However, pretty interesting to know fopen() handles encoding. Thanks! — NomeQueEuLembro, Mar 13 '17 at 17:02
So gcc compiles it ok in UTF-8 but the exe doesn't run? What actually goes wrong? — JeremyP, Mar 13 '17 at 17:04
Does Windows have an equivalent of `setlocale(LC_CTYPE, "")` that you should call early on to change the stdio coding system from the C locale, as POSIX does? — Toby Speight, Mar 13 '17 at 17:09
@JeremyP [Screenshot](http://i.imgur.com/iIJcSG4.png). Left: Source Code encoded in UTF-8 (displayed wrong). Right: Source code encoded in ANSI (displayed right). Compilation process is exactly the same, only difference is the actual source-code encoding. — NomeQueEuLembro, Mar 13 '17 at 17:20
@TobySpeight Yes, locale.h is a standard library and calling setlocale() is needed for the program to work properly in the ANSI version. In the UTF-8 version setlocale() changes the displayed chars, but both are wrong. — NomeQueEuLembro, Mar 13 '17 at 17:27
You need to find an API function that will convert your UTF-8 strings into strings using the encoding that the user has set for his/her PC. I don't know what that is in C on Windows. — JeremyP, Mar 13 '17 at 17:37
@Olaf [This answer](http://stackoverflow.com/a/701920/7362590) explains better than I could. — NomeQueEuLembro, Mar 13 '17 at 17:45
Can you write a [mcve] (which would be little more than `puts("foo");` and the compile command) so we can see what you tried so far? — Toby Speight, Mar 13 '17 at 17:50
It does not. It is actually wrong. There is no "ANSI" encoding. There is just a whole zoo of encodings which are base on ASCII, using aditional 128 codes for various purposes. They are all very different. And none covers the full Unicode character set. — too honest for this site, Mar 13 '17 at 17:50
Hey guys, I managed to figure it out (and posted it as a reply)! Thanks very much for your help! — NomeQueEuLembro, Mar 14 '17 at 14:15

NomeQueEuLembro · Answer 1 · 2017-03-14T14:15:07.487

The issue is caused by the fact that the Windows Terminal have issues displaying UTF-8 encoded characters normally.

To solve the issue you need to tell the terminal to use the UTF-8 Code Page. You do not need to call setlocale() after changing the codepage, as this will probably mess things.

To tell Windows which codepage it should use to display output you can use the SetConsoleOutputCP function passing the UTF-8 code (65001) as parameter (for more information check "Code Page Identifiers" from MSDN).

Here is a test program:

#include <stdio.h>
#include <locale.h>
#include <windows.h>

int main(void)
{
    UINT CODEPAGE_UTF8 = 65001;
    UINT CODEPAGE_ORIGINAL = GetConsoleOutputCP();

    printf("DEFAULT CODEPAGE, DEFAULT LOCALE: ¶\n");
    setlocale(LC_ALL, "");
    printf("DEFAULT CODEPAGE, SYSTEM LOCALE: ¶\n");

    SetConsoleOutputCP(CODEPAGE_UTF8);

    setlocale(LC_ALL, "C");
    printf("UTF-8 CODEPAGE, DEFAULT LOCALE: ¶\n");

    setlocale(LC_ALL, "");
    printf("UTF-8 CODEPAGE, SYSTEM LOCALE: ¶\n");

    SetConsoleOutputCP(CODEPAGE_ORIGINAL);
    return 0;
}

And here's the program output, compiled with source code encoded in ANSI, UTF-8 without BOM (Byte Order Mark) and UTF-8 with BOM, respectively:

[ TEST OUTPUT ]

Caveat: Some info around the internet says this only works with certain fonts, notably Lucida Console. Also, this only works on Windows 2000 Professional and above. I don't think you will need to touch something older than that nowadays, though.

Thank you so much! `SetConsoleOutputCP(65001)` saved my day! — Andrey Sokolov, Oct 07 '20 at 10:45

How to handle UTF-8 encoded source when compiling on Windows?

1 Answers1