1

I have a source csv file which is quite big and in order to be able to work more efficiently with it I decided to split it into smaller file chunks. In order to do that, I execute the following script:

Get-Content C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 | %{$i++; $_ | Out-File C:\Users\me\Desktop\Processed\splitfile_$i.csv}

As you can see, these are csv files which contain alphanumeric data. So, I have an issue with strings similar to this one:

Hämeenkatu 33

In the target file it looks like this:

Hämeenkatu 33

I've tried to determine the encoding of the source file and it is UTF-8 (as described here). I am really wondering why it gets so messed up in the target. I've also tried the following to explicitly tell that I want the encoding to be UTF8 but without success:

Get-Content C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 | %{$i++; $_ | Out-File -Encoding "UTF8" C:\Users\me\Desktop\Processed\splitfile_$i.csv}

I am using a Windows machine running Windows 10.

user2128702
  • 1,628
  • 1
  • 23
  • 51
  • This looks like you are using Latin-1 or CP1252 to examine your files. If the input is CP1252 and that's what you are also using to examine the output file, the output is exactly correct (though you will probably need to improve your understanding of these concepts). Maybe see also the [Stack Overflow `character-encoding` tag info page](http://stackoverflow.com/tags/character-encoding/info) – tripleee Jan 20 '20 at 16:24
  • 1
    Conversely, if your input is already valid UTF-8 but you feed it to a tool which expects CP1252 input and will convert that to UTF-8, this is what it will do. – tripleee Jan 20 '20 at 16:29

2 Answers2

1

Does the input file have a bom? Try get-content -encoding utf8. Out-file defaults to utf16le or what windows and powershell call "unicode".

Get-Content -encoding utf8 C:\Users\me\Desktop\savedDataframe.csv -ReadCount 250000 | 
  %{$i++; $_ | 
  Out-File -encoding utf8 C:\Users\me\Desktop\Processed\splitfile_$i.csv}

The output file will have a bom unless you use powershell 6 or 7.

js2010
  • 13,551
  • 2
  • 28
  • 40
1

js2010's answer provides an effective solution; let me complement it with background information (a summary of the case at hand is at the bottom):

Fundamentally, PowerShell never preserves the character encoding of a [text] input file on output:

  • On reading, file content is decoded into .NET strings (which are internally UTF-16 code units):

    • Files with a BOM for the following encodings are always correctly recognized (identifiers recognized by the -Encoding parameter of PowerShell's cmdlets in parentheses):

      • UTF-8 (UTF8) - info
      • UTF-16LE (Unicode) / UTF-16BE (BigEndianUnicode) - info
      • UTF-32LE (UTF32) / UTF-32BE (BigEndianUTF32) - info
      • Note the absence of UTF-7, which, however, is rarely used as an encoding in practice.
    • Without a BOM, a default encoding is assumed:

      • PowerShell [Core] v6+ commendably assumes UTF-8.
      • The legacy Windows PowerShell (PowerShell up to v5.1) assumes ANSI encoding, i.e the code page determined by the legacy system locale; e.g., Windows-1252 on US-English systems.
    • The -Encoding parameter of file-reading cmdlets allows you to specify the source encoding explicitly, but note that the presence of a (supported) BOM overrides this - see below for what encodings are supported.

  • On writing, .NET strings are encoded based on a default encoding, unless an encoding is explicitly specified with -Encoding (the .NET strings created on reading carry no information about the encoding of the original input file, so it cannot be preserved):

    • PowerShell [Core] v6+ commendably uses BOM-less UTF-8.

    • The legacy Windows PowerShell (PowerShell up to v5.1) regrettably uses various default encodings, depending on the specific cmdlet / operator used.

      • Notably, Set-Content defaults to ANSI (as for reading), and Out-File / > defaults to UTF-16LE.

      • As noted in js2010's answer, using -Encoding UTF8 in Windows PowerShell invariably creates files with a BOM, which can be problematic for files read by tools on Unix-like platforms / tools with a Unix heritage, which are often not equipped to deal with such a BOM.

        • See the answers to this question for how to create BOM-less UTF-8 files in Windows PowerShell.
    • As with reading, the -Encoding parameter of file-writing cmdlets allows you to specify the output encoding explicitly:

      • Note that in PowerShell [Core] v6+, in addition to its defaulting to BOM-less UTF-8, -Encoding UTF8 too refers to the BOM-less variant (unlike in Windows PowerShell), and there you must use -Encoding UTF8BOM in order to create a file with BOM.

      • Curiously, as of PowerShell [Core] v7.0, there is no -Encoding value for the system's active ANSI code page, i.e. for Windows PowerShell's default (in Windows PowerShell, -Encoding Default explicitly request ANSI encoding, but in PowerShell [Core] this refers to BOM-less UTF-8). This problematic omission is discussed in this GitHub issue. By contrast, targeting the active OEM code page with -Encoding OEM still works.

      • In order to create UTF-32BE files, Windows PowerShell requires identifier BigEndianUtf32; due to a bug in PowerShell [Core] as of v7.0, this identifier isn't supported, but you can use UTF-32BE instead.

      • Windows PowerShell is limited to those encodings listed in the Microsoft.PowerShell.Commands.FileSystemCmdletProviderEncoding enumeration, but PowerShell [Core] allows you to pass any of the supported .NET encodings to the -Encoding parameter, either by code-page number (e.g., 1252) or by encoding name (e.g., windows-1252); [Text.Encoding]::GetEncodings().CodePage and [Text.Encoding]::GetEncodings().Name enumerate them in principle, but note that due to lack of .NET Core API support as of v7.0 this enumeration lists only a small subset of the actually supported encodings; running these commands in Windows PowerShell will show them all.

      • You can create UTF-7 files (UTF7), but they won't have a BOM; even input files that do have one aren't automatically recognized on reading, so specifying -Encoding UTF7 is always necessary for reading UTF-7 files.

In short:

  • In PowerShell, you have to know an input file's encoding in order to match that encoding on writing, and specify that encoding explicitly via the -Encoding parameter (if it differs from the default).

  • Get-Content (without -Encoding) provides no information as to what encoding it detected via a BOM or which one it assumed in the absence of a BOM.

  • If needed, you can perform your own analysis of the opening bytes of a text file to look for a BOM, but note that in the absence of one you'll have to rely on heuristics to infer the encoding - that is, you can make a reasonable guess, but you cannot be certain.

Also note that PowerShell, as of v7, fundamentally lacks support for passing raw byte streams through the pipeline - see this answer.


Your particular case:

Your problem was that your input file was UTF-8-encoded, but didn't have a BOM (which is actually preferable for the widest compatibility).

Since you're using Windows PowerShell, which misinterprets such files as ANSI-encoded, you need to tell it to read the file as UTF-8 with -Encoding Utf8.

As stated, on writing -Encoding Utf8 inevitably creates a file with BOM in Windows PowerShell; if that is a concern, use the .NET framework directly to produce a BOM-less files, as shown in the answers to this question.

Note that you would have had no problem with your original command in PowerShell [Core] v6+ - it defaults to BOM-less UTF-8 both on reading and writing, across all cmdlets.

This sensible, standardized default alone is a good reason for considering the move to PowerShell v7.0, which aims to be a superior replacement for the legacy Windows PowerShell.

mklement0
  • 245,023
  • 45
  • 419
  • 492