278

When I open cmd.exe in Windows, what encoding is it using?

How can I check which encoding it is currently using? Does it depend on my regional setting or are there any environment variables to check?

What happens when you type a file with a certain encoding? Sometimes I get garbled characters (incorrect encoding used) and sometimes it kind of works. However I don't trust anything as long as I don't know what's going on. Can anyone explain?

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Dan Gøran Lunde
  • 4,029
  • 3
  • 21
  • 22

6 Answers6

407

Yes, it’s frustrating—sometimes type and other programs print gibberish, and sometimes they do not.

First of all, Unicode characters will only display if the current console font contains the characters. So use a TrueType font like Lucida Console instead of the default Raster Font.

But if the console font doesn’t contain the character you’re trying to display, you’ll see question marks instead of gibberish. When you get gibberish, there’s more going on than just font settings.

When programs use standard C-library I/O functions like printf, the program’s output encoding must match the console’s output encoding, or you will get gibberish. chcp shows and sets the current codepage. All output using standard C-library I/O functions is treated as if it is in the codepage displayed by chcp.

Matching the program’s output encoding with the console’s output encoding can be accomplished in two different ways:

  • A program can get the console’s current codepage using chcp or GetConsoleOutputCP, and configure itself to output in that encoding, or

  • You or a program can set the console’s current codepage using chcp or SetConsoleOutputCP to match the default output encoding of the program.

However, programs that use Win32 APIs can write UTF-16LE strings directly to the console with WriteConsoleW. This is the only way to get correct output without setting codepages. And even when using that function, if a string is not in the UTF-16LE encoding to begin with, a Win32 program must pass the correct codepage to MultiByteToWideChar. Also, WriteConsoleW will not work if the program’s output is redirected; more fiddling is needed in that case.

type works some of the time because it checks the start of each file for a UTF-16LE Byte Order Mark (BOM), i.e. the bytes 0xFF 0xFE. If it finds such a mark, it displays the Unicode characters in the file using WriteConsoleW regardless of the current codepage. But when typeing any file without a UTF-16LE BOM, or for using non-ASCII characters with any command that doesn’t call WriteConsoleW—you will need to set the console codepage and program output encoding to match each other.


How can we find this out?

Here’s a test file containing Unicode characters:

ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    ąęźżńł
Russian   абвгдеж эюя
CJK       你好

Here’s a Java program to print out the test file in a bunch of different Unicode encodings. It could be in any programming language; it only prints ASCII characters or encoded bytes to stdout.

import java.io.*;

public class Foo {

    private static final String BOM = "\ufeff";
    private static final String TEST_STRING
        = "ASCII     abcde xyz\n"
        + "German    äöü ÄÖÜ ß\n"
        + "Polish    ąęźżńł\n"
        + "Russian   абвгдеж эюя\n"
        + "CJK       你好\n";

    public static void main(String[] args)
        throws Exception
    {
        String[] encodings = new String[] {
            "UTF-8", "UTF-16LE", "UTF-16BE", "UTF-32LE", "UTF-32BE" };

        for (String encoding: encodings) {
            System.out.println("== " + encoding);

            for (boolean writeBom: new Boolean[] {false, true}) {
                System.out.println(writeBom ? "= bom" : "= no bom");

                String output = (writeBom ? BOM : "") + TEST_STRING;
                byte[] bytes = output.getBytes(encoding);
                System.out.write(bytes);
                FileOutputStream out = new FileOutputStream("uc-test-"
                    + encoding + (writeBom ? "-bom.txt" : "-nobom.txt"));
                out.write(bytes);
                out.close();
            }
        }
    }
}

The output in the default codepage? Total garbage!

Z:\andrew\projects\sx\1259084>chcp
Active code page: 850

Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII     abcde xyz
German    ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish    ąęźżńł
Russian   ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK       õ¢áÕÑ¢
= bom
´╗┐ASCII     abcde xyz
German    ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish    ąęźżńł
Russian   ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK       õ¢áÕÑ¢
== UTF-16LE
= no bom
A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   ─ Í ▄   ▀
 P o l i s h         ♣☺↓☺z☺|☺D☺B☺
 R u s s i a n       0♦1♦2♦3♦4♦5♦6♦  M♦N♦O♦
 C J K               `O}Y
 = bom
 ■A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   ─ Í ▄   ▀
 P o l i s h         ♣☺↓☺z☺|☺D☺B☺
 R u s s i a n       0♦1♦2♦3♦4♦5♦6♦  M♦N♦O♦
 C J K               `O}Y
 == UTF-16BE
= no bom
 A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   ─ Í ▄   ▀
 P o l i s h        ☺♣☺↓☺z☺|☺D☺B
 R u s s i a n      ♦0♦1♦2♦3♦4♦5♦6  ♦M♦N♦O
 C J K              O`Y}
= bom
■  A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   ─ Í ▄   ▀
 P o l i s h        ☺♣☺↓☺z☺|☺D☺B
 R u s s i a n      ♦0♦1♦2♦3♦4♦5♦6  ♦M♦N♦O
 C J K              O`Y}
== UTF-32LE
= no bom
A   S   C   I   I                       a   b   c   d   e       x   y   z
   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀
   P   o   l   i   s   h                   ♣☺  ↓☺  z☺  |☺  D☺  B☺
   R   u   s   s   i   a   n               0♦  1♦  2♦  3♦  4♦  5♦  6♦      M♦  N
♦  O♦
   C   J   K                               `O  }Y
   = bom
 ■  A   S   C   I   I                       a   b   c   d   e       x   y   z

   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀
   P   o   l   i   s   h                   ♣☺  ↓☺  z☺  |☺  D☺  B☺
   R   u   s   s   i   a   n               0♦  1♦  2♦  3♦  4♦  5♦  6♦      M♦  N
♦  O♦
   C   J   K                               `O  }Y
   == UTF-32BE
= no bom
   A   S   C   I   I                       a   b   c   d   e       x   y   z
   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀
   P   o   l   i   s   h                  ☺♣  ☺↓  ☺z  ☺|  ☺D  ☺B
   R   u   s   s   i   a   n              ♦0  ♦1  ♦2  ♦3  ♦4  ♦5  ♦6      ♦M  ♦N
  ♦O
   C   J   K                              O`  Y}
= bom
  ■    A   S   C   I   I                       a   b   c   d   e       x   y   z

   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀
   P   o   l   i   s   h                  ☺♣  ☺↓  ☺z  ☺|  ☺D  ☺B
   R   u   s   s   i   a   n              ♦0  ♦1  ♦2  ♦3  ♦4  ♦5  ♦6      ♦M  ♦N
  ♦O
   C   J   K                              O`  Y}

However, what if we type the files that got saved? They contain the exact same bytes that were printed to the console.

Z:\andrew\projects\sx\1259084>type *.txt

uc-test-UTF-16BE-bom.txt


■  A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   ─ Í ▄   ▀
 P o l i s h        ☺♣☺↓☺z☺|☺D☺B
 R u s s i a n      ♦0♦1♦2♦3♦4♦5♦6  ♦M♦N♦O
 C J K              O`Y}

uc-test-UTF-16BE-nobom.txt


 A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   ─ Í ▄   ▀
 P o l i s h        ☺♣☺↓☺z☺|☺D☺B
 R u s s i a n      ♦0♦1♦2♦3♦4♦5♦6  ♦M♦N♦O
 C J K              O`Y}

uc-test-UTF-16LE-bom.txt


ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    ąęźżńł
Russian   абвгдеж эюя
CJK       你好

uc-test-UTF-16LE-nobom.txt


A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   ─ Í ▄   ▀
 P o l i s h         ♣☺↓☺z☺|☺D☺B☺
 R u s s i a n       0♦1♦2♦3♦4♦5♦6♦  M♦N♦O♦
 C J K               `O}Y

uc-test-UTF-32BE-bom.txt


  ■    A   S   C   I   I                       a   b   c   d   e       x   y   z

   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀
   P   o   l   i   s   h                  ☺♣  ☺↓  ☺z  ☺|  ☺D  ☺B
   R   u   s   s   i   a   n              ♦0  ♦1  ♦2  ♦3  ♦4  ♦5  ♦6      ♦M  ♦N
  ♦O
   C   J   K                              O`  Y}

uc-test-UTF-32BE-nobom.txt


   A   S   C   I   I                       a   b   c   d   e       x   y   z
   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀
   P   o   l   i   s   h                  ☺♣  ☺↓  ☺z  ☺|  ☺D  ☺B
   R   u   s   s   i   a   n              ♦0  ♦1  ♦2  ♦3  ♦4  ♦5  ♦6      ♦M  ♦N
  ♦O
   C   J   K                              O`  Y}

uc-test-UTF-32LE-bom.txt


 A S C I I           a b c d e   x y z
 G e r m a n         ä ö ü   Ä Ö Ü   ß
 P o l i s h         ą ę ź ż ń ł
 R u s s i a n       а б в г д е ж   э ю я
 C J K               你 好

uc-test-UTF-32LE-nobom.txt


A   S   C   I   I                       a   b   c   d   e       x   y   z
   G   e   r   m   a   n                   õ   ÷   ³       ─   Í   ▄       ▀
   P   o   l   i   s   h                   ♣☺  ↓☺  z☺  |☺  D☺  B☺
   R   u   s   s   i   a   n               0♦  1♦  2♦  3♦  4♦  5♦  6♦      M♦  N
♦  O♦
   C   J   K                               `O  }Y

uc-test-UTF-8-bom.txt


´╗┐ASCII     abcde xyz
German    ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish    ąęźżńł
Russian   ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK       õ¢áÕÑ¢

uc-test-UTF-8-nobom.txt


ASCII     abcde xyz
German    ├ñ├Â├╝ ├ä├û├£ ├ƒ
Polish    ąęźżńł
Russian   ð░ð▒ð▓ð│ð┤ðÁð ÐìÐÄÐÅ
CJK       õ¢áÕÑ¢

The only thing that works is UTF-16LE file, with a BOM, printed to the console via type.

If we use anything other than type to print the file, we get garbage:

Z:\andrew\projects\sx\1259084>copy uc-test-UTF-16LE-bom.txt CON
 ■A S C I I           a b c d e   x y z
 G e r m a n         õ ÷ ³   ─ Í ▄   ▀
 P o l i s h         ♣☺↓☺z☺|☺D☺B☺
 R u s s i a n       0♦1♦2♦3♦4♦5♦6♦  M♦N♦O♦
 C J K               `O}Y
         1 file(s) copied.

From the fact that copy CON does not display Unicode correctly, we can conclude that the type command has logic to detect a UTF-16LE BOM at the start of the file, and use special Windows APIs to print it.

We can see this by opening cmd.exe in a debugger when it goes to type out a file:

enter image description here

After type opens a file, it checks for a BOM of 0xFEFF—i.e., the bytes 0xFF 0xFE in little-endian—and if there is such a BOM, type sets an internal fOutputUnicode flag. This flag is checked later to decide whether to call WriteConsoleW.

But that’s the only way to get type to output Unicode, and only for files that have BOMs and are in UTF-16LE. For all other files, and for programs that don’t have special code to handle console output, your files will be interpreted according to the current codepage, and will likely show up as gibberish.

You can emulate how type outputs Unicode to the console in your own programs like so:

#include <stdio.h>
#define UNICODE
#include <windows.h>

static LPCSTR lpcsTest =
    "ASCII     abcde xyz\n"
    "German    äöü ÄÖÜ ß\n"
    "Polish    ąęźżńł\n"
    "Russian   абвгдеж эюя\n"
    "CJK       你好\n";

int main() {
    int n;
    wchar_t buf[1024];

    HANDLE hConsole = GetStdHandle(STD_OUTPUT_HANDLE);

    n = MultiByteToWideChar(CP_UTF8, 0,
            lpcsTest, strlen(lpcsTest),
            buf, sizeof(buf));

    WriteConsole(hConsole, buf, n, &n, NULL);

    return 0;
}

This program works for printing Unicode on the Windows console using the default codepage.


For the sample Java program, we can get a little bit of correct output by setting the codepage manually, though the output gets messed up in weird ways:

Z:\andrew\projects\sx\1259084>chcp 65001
Active code page: 65001

Z:\andrew\projects\sx\1259084>java Foo
== UTF-8
= no bom
ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    ąęźżńł
Russian   абвгдеж эюя
CJK       你好
ж эюя
CJK       你好
 你好
好
�
= bom
ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    ąęźżńł
Russian   абвгдеж эюя
CJK       你好
еж эюя
CJK       你好
  你好
好
�
== UTF-16LE
= no bom
A S C I I           a b c d e   x y z
…

However, a C program that sets a Unicode UTF-8 codepage:

#include <stdio.h>
#include <windows.h>

int main() {
    int c, n;
    UINT oldCodePage;
    char buf[1024];

    oldCodePage = GetConsoleOutputCP();
    if (!SetConsoleOutputCP(65001)) {
        printf("error\n");
    }

    freopen("uc-test-UTF-8-nobom.txt", "rb", stdin);
    n = fread(buf, sizeof(buf[0]), sizeof(buf), stdin);
    fwrite(buf, sizeof(buf[0]), n, stdout);

    SetConsoleOutputCP(oldCodePage);

    return 0;
}

does have correct output:

Z:\andrew\projects\sx\1259084>.\test
ASCII     abcde xyz
German    äöü ÄÖÜ ß
Polish    ąęźżńł
Russian   абвгдеж эюя
CJK       你好

The moral of the story?

  • type can print UTF-16LE files with a BOM regardless of your current codepage
  • Win32 programs can be programmed to output Unicode to the console, using WriteConsoleW.
  • Other programs which set the codepage and adjust their output encoding accordingly can print Unicode on the console regardless of what the codepage was when the program started
  • For everything else you will have to mess around with chcp, and will probably still get weird output.
andrewdotn
  • 27,806
  • 6
  • 80
  • 110
  • 77
    Whoa, this must be the most detailed answer I've ever seen on SO. Extra credit for the dissasembly prints and multilanguage skillz! Just beautiful, sir! – airstrike Jun 28 '13 at 16:58
  • 2
    One may also want to study the Microsoft-specific extension _setmode(_fileno(stdout), _O_U16TEXT) which was introduced in VS2008. See http://stackoverflow.com/a/9051543, and http://stackoverflow.com/a/12015918, and http://msdn.microsoft.com/en-us/library/tw4k6df8(v=vs.90).aspx Besides obvious portability differences between _setmode() and SetConsoleOutputCP(), there may also be other subtleties and side-effects hidden in both approaches that's not fully understood at first glance. If andrewdotn could update his answer with any observations about _setmode(fd,_O_U16TEXT), that would be great. – JasDev Sep 07 '14 at 12:58
  • 14
    While this is an excellent answer, it's misleading to say the console supports UTF-16. It's limited to UCS-2, i.e. limited to characters in the basic multilingual plane (BMP). When the Win32 console server (conhost.exe, nowadays) was designed circa 1990, Unicode was a 16-bit standard, so the console screen buffer uses one 16-bit WCHAR per character cell. A UTF-16 surrogate pair prints as two box characters. – Eryk Sun Feb 23 '15 at 06:22
  • 1
    @eryksun - presumably "one 16-bit WCHAR per character cell" also means that the console does not support any form of combining character? Are there other limitations on Unicode in the Windows console? – user200783 Jun 17 '16 at 12:14
  • 4
    @user200783, decomposed form isn't supported; usually one can transform to an NFC equivalent. Also, the console in Western locales doesn't allow mixing full-width and half-width glyphs. Also, when using codepage 65001 (UTF-8), prior to Windows 8 `WriteFile` reports the number of characters written instead of the number of bytes, so buffered writers retry the 'remaining' bytes several times in proportion to the number of non-ASCII characters. Also in 65001, reading non-ASCII characters fails in conhost.exe because it assumes 1 ANSI byte per UTF-16 code when calling `WideCharToMultiByte`. – Eryk Sun Jun 17 '16 at 13:15
  • @eryksun - Very helpful, thank you. Your two comments do a good job of addressing [one of my recent questions](http://stackoverflow.com/q/37874063/200783) - if you would like to post them as an answer there, I will mark it as accepted. – user200783 Jun 18 '16 at 01:45
  • 2
    The simple demo programs in this answer assume that `GetStdHandle(STD_OUTPUT_HANDLE)` and C `stdout` are console handles. In practice, to test for a console, check that [`GetConsoleMode`](https://msdn.microsoft.com/en-us/library/ms683167) succeeds. Also, don't use the C runtime `_isatty` function to check if a low I/O file descriptor is a console; that just checks for a character-mode device, which includes `NUL` among others. Instead, call [`_get_osfhandle`](https://msdn.microsoft.com/en-us/library/ks2530z6) and check the handle directly. – Eryk Sun Jun 18 '16 at 08:22
  • just as a side-note: if you need this for **PHP** CMD-output it might be "codepage 850". so to convert utf8 to CP850 for example, you can do this: `mb_convert_encoding($string, 'CP850', 'UTF-8');` – low_rents Jul 15 '16 at 10:32
  • This helped me understand the output of Oracle Instant Client on windows command prompt. Thank you! – Alfabravo Aug 01 '16 at 16:47
  • 1
    I've learned and want to explore a lot of tests, thanks! I have troubles with terminal input sometimes too, so i'll go that way. – Cerberus Dec 14 '18 at 17:15
30

Type

chcp

to see your current code page (as Dewfy already said).

Use

nlsinfo

to see all installed code pages and find out what your code page number means.

You need to have Windows Server 2003 Resource kit installed (works on Windows XP) to use nlsinfo.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Cagdas Altinkaya
  • 1,570
  • 2
  • 19
  • 32
21

To answer your second query re. how encoding works, Joel Spolsky wrote a great introductory article on this. Strongly recommended.

Brian Agnew
  • 254,044
  • 36
  • 316
  • 423
  • 13
    I've read it and I know it. However, on Windows I always feel lost because the OS and most applications seem totally ignorant of encoding. – Dan Gøran Lunde Aug 11 '09 at 13:59
5

Command CHCP shows the current codepage. It has three digits: 8xx and is different from Windows 12xx. So typing a English-only text you wouldn't see any difference, but an extended codepage (like Cyrillic) will be printed wrongly.

Peter Mortensen
  • 28,342
  • 21
  • 95
  • 123
Dewfy
  • 21,895
  • 12
  • 66
  • 114
  • 5
    CHCP neither shows only 3 digits nor it is in the 8## format. 437 is for instance a US encoding, and it is the defacto standard on English systems. -- 65001 is a Unicode encoding (If I recall it right it is UTF-8 and 65000 is UTF-7) and can be chosen. Also CMD allows to switch to 1250 code page for instance, but I do not know since when these code pages are selectable. (It is under Win7.) – Adam L. S. Dec 26 '14 at 14:08
4

I've been frustrated for long by Windows code page issues, and the C programs portability and localisation issues they cause. The previous posts have detailed the issues at length, so I'm not going to add anything in this respect.

To make a long story short, eventually I ended up writing my own UTF-8 compatibility library layer over the Visual C++ standard C library. Basically this library ensures that a standard C program works right, in any code page, using UTF-8 internally.

This library, called MsvcLibX, is available as open source at https://github.com/JFLarvoire/SysToolsLib. Main features:

  • C sources encoded in UTF-8, using normal char[] C strings, and standard C library APIs.
  • In any code page, everything is processed internally as UTF-8 in your code, including the main() routine argv[], with standard input and output automatically converted to the right code page.
  • All stdio.h file functions support UTF-8 pathnames > 260 characters, up to 64 KBytes actually.
  • The same sources can compile and link successfully in Windows using Visual C++ and MsvcLibX and Visual C++ C library, and in Linux using gcc and Linux standard C library, with no need for #ifdef ... #endif blocks.
  • Adds include files common in Linux, but missing in Visual C++. Ex: unistd.h
  • Adds missing functions, like those for directory I/O, symbolic link management, etc, all with UTF-8 support of course :-).

More details in the MsvcLibX README on GitHub, including how to build the library and use it in your own programs.

The release section in the above GitHub repository provides several programs using this MsvcLibX library, that will show its capabilities. Ex: Try my which.exe tool with directories with non-ASCII names in the PATH, searching for programs with non-ASCII names, and changing code pages.

Another useful tool there is the conv.exe program. This program can easily convert a data stream from any code page to any other. Its default is input in the Windows code page, and output in the current console code page. This allows to correctly view data generated by Windows GUI apps (ex: Notepad) in a command console, with a simple command like: type WINFILE.txt | conv

This MsvcLibX library is by no means complete, and contributions for improving it are welcome!

2

In Java I used encoding "IBM850" to write the file. That solved the problem.

Neumi
  • 31
  • 2