Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

Question

I need to decode PowerShell stdout called from Python into a Python string.

My ultimate goal is to get in a form of a list of strings the names of network adapters on Windows. My current function looks like this and works well on Windows 10 with the English language:

def get_interfaces():
    ps = subprocess.Popen(['powershell', 'Get-NetAdapter', '|', 'select Name', '|', 'fl'], stdout = subprocess.PIPE)
    stdout, stdin = ps.communicate(timeout = 10)
    interfaces = []
    for i in stdout.split(b'\r\n'):
        if not i.strip():
            continue
        if i.find(b':')<0:
            continue
        name, value = [ j.strip() for j in i.split(b':') ]
        if name == b'Name':
            interfaces.append(value.decode('ascii')) # This fails for other users
    return interfaces

Other users have different languages, so value.decode('ascii') fails for some of them. E.g. one user reported that changing to decode('ISO 8859-2') works well for him (so it is not UTF-8). How can I know encoding to decode the stdout bytes returned by call to PowerShell?

UPDATE

After some experiments I am even more confused. The codepage in my console as returned by chcp is 437. I changed the network adapter name to a name containing non-ASCII and non-cp437 characters. In an interactive PowerShell session running Get-NetAdapter | select Name | fl, it correctly displayed the name, even its non-CP437 character. When I called PowerShell from Python non-ASCII characters were converted to the closest ASCII characters (for example, ā to a, ž to z) and .decode(ascii) worked nicely. Could this behaviour (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.

If your actual issue is how to get `powershell` output as Unicode text then you should put it into the title (I don't know what "default Windows display language encoding" is supposed to be). Check whether powershell accepts an explicit parameter to specify its stdout encoding (`$OutputEncoding`). Unrelated: use a string on Windows to pass a command i.e., use `'a | b | c'` instead of `['a', '|', 'b', '|', 'c']`. — jfs, Nov 26 '15 at 14:47
That's good idea for a workaround, but does not seem trivial. See http://stackoverflow.com/questions/22349139/utf-8-output-from-powershell Also, I would anyway be interested to find out default language encoding for other possible uses. — Eriks Dobelis, Nov 26 '15 at 15:49
Don't complicate your task. I don't know what "default language encoding" is: is it `'mbcs'` (Windows encoding)? Is it encoding from `chcp` output (Windows "ANSI" encoding)? Is it leaking of Unicode API abstractions (UCS-2 or UTF-16le w/out BOM)? The question: "how to get a powershell stdout for a given command that might contain arbitrary Unicode characters?" is different from «what is "default Windows display language encoding"?». — jfs, Nov 26 '15 at 16:07
Agreed. Changed the title. Finding out windows interface language will be a different issue with other code example. — Eriks Dobelis, Nov 26 '15 at 16:11
btw, I've wrongly used "ANSI" instead of "OEM" above. See [Keep your eye on the code page](http://blogs.msdn.com/b/oldnewthing/archive/2005/03/08/389527.aspx). Related: [How to call type safely on a random file in Python?](http://stackoverflow.com/q/31841994/4279) — jfs, Nov 26 '15 at 16:14
note: printing to Windows console may use a different API (such as `WriteConsoleW()`). It is Unicode API and therefore [it works whatever `chcp` returns](http://stackoverflow.com/a/32176732/4279). The redirected (to a pipe) stdout does not use this API (`Popen(cmd, stdout=PIPE)` case). Python 3 uses `locale.getpreferredencoding(False)` encoding in this case (something like cp1252 -- ANSI code page (`'mbcs'` equiv.)) while some command-line applications may use OEM code page (e.g., cp437 from `chcp`) here. — jfs, Nov 26 '15 at 16:25
Actually, `locale.getprefferedencoding(False)` returns `cp1252` for me. And still `.decode(ascii)` works fine on my machine with non-cp1252 characters in adapter names as in the UPDATE part above. — Eriks Dobelis, Nov 26 '15 at 16:43
(1) your code uses binary mode. stdout is bytes in your case. `universal_newlines=True` enables text mode (yes. It is not intuitive spelling) (2) both cp437 and cp1252 are compatible with ascii encoding for ascii characters (working `.decode('ascii', 'strict')` says that all bytes in stdout are in ascii range. It can't differentiate between cp437 and cp1252). — jfs, Nov 26 '15 at 16:45
Great. I would say universal_newlines qualifies as the answer. — Eriks Dobelis, Nov 26 '15 at 16:48
It does not qualify because it would be a wrong answer. The fact that Python 3 uses cp1252 does not mean that the actual executable (the child process) uses cp1252 for its stdout (you can try to decode bytes using whatever encoding but it may fail -- it is how mojibake is created. Also, [see "Bush hid the facts" bug](https://en.wikipedia.org/wiki/Bush_hid_the_facts)). I would try `$OutputEncoding = New-Object -typename System.Text.UTF8Encoding` (in powershell) and `.decode('utf-8')` (in Python) instead. — jfs, Nov 26 '15 at 16:51
universal_newlines allows me to avoid the need to .decode(), as it returns string instead of bytes. Is there a risk the encoding used by powershell and python differs, so that string is with incorrect characters? — Eriks Dobelis, Nov 26 '15 at 17:03
yes, that is why I've mentioned mojibake. What do you get if you run `print(check_output(['powershell', 'echo É']))`? (I'm not sure how to write `'echo É'` in PowerShell). If you see `b'\x90'` in the output then the encoding is cp437. If you see `b'\xc9'` then the encoding is cp1252. btw., you could [use `for line in io.TextIOWrapper(process.stdout, encoding='utf-8'):`](http://stackoverflow.com/a/33453867/4279) if you don't want to call `.decode('utf-8')`. — jfs, Nov 26 '15 at 17:15
It seems that the default `$OutputEncoding` is ascii and therefore the above command probably produces `b'E'` (if something strips non-ascii parts) i.e., if you want to get non-ascii characters then you should set `$OutputEncoding` correspondingly (utf-8 is a good candidate). — jfs, Nov 26 '15 at 17:32
@J.F.Sebastian, the encoding of piped output seems to use the console output codepage. I tested with various codepages, e.g. w/ 1252: `ctypes.windll.kernel32.SetConsoleOutputCP(1252);` `p = subprocess.Popen('powershell echo $([char]0xc9)', stdout=subprocess.PIPE);` `p.stdout.read()`. Weirdly if I pass `creationflags=DETACHED_PROCESS`, such that powershell.exe doesn't attach to a console, the silly thing doesn't even have a sensible default of the ANSI codepage. It outputs *nothing at all*. — Eryk Sun, Nov 26 '15 at 18:17

jfs · Accepted Answer · 2018-05-20T07:38:31.497

The output character encoding may depend on specific commands e.g.:

#!/usr/bin/env python3
import subprocess
import sys

encoding = 'utf-32'
cmd = r'''$env:PYTHONIOENCODING = "%s"; py -3 -c "print('\u270c')"''' % encoding
data = subprocess.check_output(["powershell", "-C", cmd])
print(sys.stdout.encoding)
print(data)
print(ascii(data.decode(encoding)))

Output

cp437
b"\xff\xfe\x00\x00\x0c'\x00\x00\r\x00\x00\x00\n\x00\x00\x00"
'\u270c\r\n'

✌ (U+270C) character is received successfully.

The character encoding of the child script is set using PYTHONIOENCODING envvar inside the PowerShell session. I've chosen utf-32 for the output encoding so that it would be different from Windows ANSI and OEM code pages for the demonstration.

Notice that the stdout encoding of the parent Python script is OEM code page (cp437 in this case) -- the script is run from the Windows console. If you redirect the output of the parent Python script to a file/pipe then ANSI code page (e.g., cp1252) is used by default in Python 3.

To decode powershell output that might contain characters undecodable in the current OEM code page, you could set [Console]::OutputEncoding temporarily (inspired by @eryksun's comments):

#!/usr/bin/env python3
import io
import sys
from subprocess import Popen, PIPE

char = ord('✌')
filename = 'U+{char:04x}.txt'.format(**vars())
with Popen(["powershell", "-C", '''
    $old = [Console]::OutputEncoding
    [Console]::OutputEncoding = [Text.Encoding]::UTF8
    echo $([char]0x{char:04x}) | fl
    echo $([char]0x{char:04x}) | tee {filename}
    [Console]::OutputEncoding = $old'''.format(**vars())],
           stdout=PIPE) as process:
    print(sys.stdout.encoding)
    for line in io.TextIOWrapper(process.stdout, encoding='utf-8-sig'):
        print(ascii(line))
print(ascii(open(filename, encoding='utf-16').read()))

Output

cp437
'\u270c\n'
'\u270c\n'
'\u270c\n'

Both fl and tee use [Console]::OutputEncoding for stdout (the default behavior is as if | Write-Output is appended to the pipelines). tee uses utf-16, to save a text to a file. The output shows that ✌ (U+270C) is decoded successfully.

$OutputEncoding is used to decode bytes in the middle of a pipeline:

#!/usr/bin/env python3
import subprocess

cmd = r'''
  $OutputEncoding = [Console]::OutputEncoding = New-Object System.Text.UTF8Encoding
  py -3 -c "import os; os.write(1, '\U0001f60a'.encode('utf-8')+b'\n')" |
  py -3 -c "import os; print(os.read(0, 512))"
'''
subprocess.check_call(["powershell", "-C", cmd])

Output

b'\xf0\x9f\x98\x8a\r\n'

that is correct: b'\xf0\x9f\x98\x8a'.decode('utf-8') == u'\U0001f60a'. With the default $OutputEncoding (ascii) we would get b'????\r\n' instead.

Note:

b'\n' is replaced with b'\r\n' despite using binary API such as os.read/os.write (msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY) has no effect here)

b'\r\n' is appended if there is no newline in the output:

#!/usr/bin/env python3
from subprocess import check_output

cmd = '''py -3 -c "print('no newline in the input', end='')"'''
cat = '''py -3 -c "import os; os.write(1, os.read(0, 512))"'''  # pass as is
piped = check_output(['powershell', '-C', '{cmd} | {cat}'.format(**vars())])
no_pipe = check_output(['powershell', '-C', '{cmd}'.format(**vars())])
print('piped:   {piped}\nno pipe: {no_pipe}'.format(**vars()))

Output:

piped:   b'no newline in the input\r\n'
no pipe: b'no newline in the input'

The newline is appended to the piped output.

If we ignore lone surrogates then setting UTF8Encoding allows to pass via pipes all Unicode characters including non-BMP characters. Text mode could be used in Python if $env:PYTHONIOENCODING = "utf-8:ignore" is configured.

In interactive powershell running Get-NetAdapter | select Name | fl displayed correctly the name even its non-cp437 character.

If stdout is not redirected then Unicode API is used, to print characters to the console -- any [BMP] Unicode character can be displayed if the console (TrueType) font supports it.

When I called powershell from python non-ascii characters were converted to closest ascii characters (e.g. ā to a, ž to z) and .decode(ascii) worked nicely.

It might be due to System.Text.InternalDecoderBestFitFallback set for [Console]::OutputEncoding -- if a Unicode character can't be encoded in a given encoding then it is passed to the fallback (either a best fit char or '?' is used instead of the original character).

Could this behavior (and correspondingly solution) be Windows version dependent? I am on Windows 10, but users could be on older Windows down to Windows 7.

If we ignore bugs in cp65001 and a list of new encodings that are supported in later versions then the behavior should be the same.

I assume you have py.exe configured to run Python 3. Add the `-3` option for the benefit of others. Otherwise, a big +1. Nice answer. — Eryk Sun, Nov 27 '15 at 16:29
As to PowerShell's object pipeline, it seems you have to completely [reinvent the wheel](https://gist.github.com/beatcracker/931613a94a1988d17b31) to get a binary pipeline that avoids the problem you experienced with text encoding and LF getting converted to CRLF. The cmd shell simply redirects its own standard handles when creating the processes for a pipeline. Once the processes are established it doesn't get in the way as a man in the middle. — Eryk Sun, Nov 27 '15 at 16:30
I did not expect question to be so difficult when I asked it. Thanks for the time invested and detailed answer! — Eriks Dobelis, Nov 28 '15 at 07:12
It seems you are one of the most qualified to answer a [similar question](https://stackoverflow.com/questions/49121900/windows-console-encoding) about Windows console encoding. — Maggyero, Mar 06 '18 at 01:11

score -1 · Answer 2 · answered Nov 26 '15 at 15:26

-1

It's a Python 2 bug already marked as wontfix: https://bugs.python.org/issue19264

I must use Python 3 if you want to make it work under Windows.

answered Nov 26 '15 at 15:26

sorin

137,198
150
472
707

I am using python3 already. It is explicitly marked as python-3.x question via tags. Also, you can see it from code (b'') that this is python3. – Eriks Dobelis Nov 26 '15 at 15:33
the bug is unrelated to the issue in the question. The bug is about how the command-line is passed to Windows and the question is about subprocess' stdout encoding. – jfs Nov 26 '15 at 15:56
In fact they are the same thing: if you use unicode (W) API on Windows you will get it to work, without having to decode/encode. – sorin Nov 26 '15 at 16:03
@sorin What is your suggestion? To call it from python in some other way? – Eriks Dobelis Nov 26 '15 at 16:05
@sorin: please, do provide a code example in Python 3 that returns subprocess' stdout as Unicode regardless of what `chcp` returns or what 'mbcs' corresponds to. – jfs Nov 26 '15 at 16:09

Decode PowerShell output possibly containing non-ASCII Unicode characters into a Python string

2 Answers2

Output

Output

Output

Linked