js2010's answer provides an effective solution; let me complement it with background information (a summary of the case at hand is at the bottom):
Fundamentally, PowerShell never preserves the character encoding of a [text] input file on output:
On reading, file content is decoded into .NET strings (which are internally UTF-16 code units):
Files with a BOM for the following encodings are always correctly recognized (identifiers recognized by the -Encoding
parameter of PowerShell's cmdlets in parentheses):
- UTF-8 (
UTF8
) - info
- UTF-16LE (
Unicode
) / UTF-16BE (BigEndianUnicode
) - info
- UTF-32LE (
UTF32
) / UTF-32BE (BigEndianUTF32
) - info
- Note the absence of UTF-7, which, however, is rarely used as an encoding in practice.
Without a BOM, a default encoding is assumed:
- PowerShell [Core] v6+ commendably assumes UTF-8.
- The legacy Windows PowerShell (PowerShell up to v5.1) assumes ANSI encoding, i.e the code page determined by the legacy system locale; e.g., Windows-1252 on US-English systems.
The -Encoding
parameter of file-reading cmdlets allows you to specify the source encoding explicitly, but note that the presence of a (supported) BOM overrides this - see below for what encodings are supported.
On writing, .NET strings are encoded based on a default encoding, unless an encoding is explicitly specified with -Encoding
(the .NET strings created on reading carry no information about the encoding of the original input file, so it cannot be preserved):
PowerShell [Core] v6+ commendably uses BOM-less UTF-8.
The legacy Windows PowerShell (PowerShell up to v5.1) regrettably uses various default encodings, depending on the specific cmdlet / operator used.
Notably, Set-Content
defaults to ANSI (as for reading), and Out-File
/ >
defaults to UTF-16LE.
As noted in js2010's answer, using -Encoding UTF8
in Windows PowerShell invariably creates files with a BOM, which can be problematic for files read by tools on Unix-like platforms / tools with a Unix heritage, which are often not equipped to deal with such a BOM.
- See the answers to this question for how to create BOM-less UTF-8 files in Windows PowerShell.
As with reading, the -Encoding
parameter of file-writing cmdlets allows you to specify the output encoding explicitly:
Note that in PowerShell [Core] v6+, in addition to its defaulting to BOM-less UTF-8, -Encoding UTF8
too refers to the BOM-less variant (unlike in Windows PowerShell), and there you must use -Encoding UTF8BOM
in order to create a file with BOM.
Curiously, as of PowerShell [Core] v7.0, there is no -Encoding
value for the system's active ANSI code page, i.e. for Windows PowerShell's default (in Windows PowerShell, -Encoding Default
explicitly request ANSI encoding, but in PowerShell [Core] this refers to BOM-less UTF-8). This problematic omission is discussed in this GitHub issue. By contrast, targeting the active OEM code page with -Encoding OEM
still works.
In order to create UTF-32BE files, Windows PowerShell requires identifier BigEndianUtf32
; due to a bug in PowerShell [Core] as of v7.0, this identifier isn't supported, but you can use UTF-32BE
instead.
Windows PowerShell is limited to those encodings listed in the Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding
enumeration, but PowerShell [Core] allows you to pass any of the supported .NET encodings to the -Encoding
parameter, either by code-page number (e.g., 1252
) or by encoding name (e.g., windows-1252
); [Text.Encoding]::GetEncodings().CodePage
and [Text.Encoding]::GetEncodings().Name
enumerate them in principle, but note that due to lack of .NET Core API support as of v7.0 this enumeration lists only a small subset of the actually supported encodings; running these commands in Windows PowerShell will show them all.
You can create UTF-7 files (UTF7
), but they won't have a BOM; even input files that do have one aren't automatically recognized on reading, so specifying -Encoding UTF7
is always necessary for reading UTF-7 files.
In short:
In PowerShell, you have to know an input file's encoding in order to match that encoding on writing, and specify that encoding explicitly via the -Encoding
parameter (if it differs from the default).
Get-Content
(without -Encoding
) provides no information as to what encoding it detected via a BOM or which one it assumed in the absence of a BOM.
If needed, you can perform your own analysis of the opening bytes of a text file to look for a BOM, but note that in the absence of one you'll have to rely on heuristics to infer the encoding - that is, you can make a reasonable guess, but you cannot be certain.
Also note that PowerShell, as of v7, fundamentally lacks support for passing raw byte streams through the pipeline - see this answer.
Your particular case:
Your problem was that your input file was UTF-8-encoded, but didn't have a BOM (which is actually preferable for the widest compatibility).
Since you're using Windows PowerShell, which misinterprets such files as ANSI-encoded, you need to tell it to read the file as UTF-8 with -Encoding Utf8
.
As stated, on writing -Encoding Utf8
inevitably creates a file with BOM in Windows PowerShell; if that is a concern, use the .NET framework directly to produce a BOM-less files, as shown in the answers to this question.
Note that you would have had no problem with your original command in PowerShell [Core] v6+ - it defaults to BOM-less UTF-8 both on reading and writing, across all cmdlets.
This sensible, standardized default alone is a good reason for considering the move to PowerShell v7.0, which aims to be a superior replacement for the legacy Windows PowerShell.