1

I've been trying to track down an odd encoding issue with artifacts coming out of GitLab.

One XML file was going in as UTF8 and coming out as UCS-2 LE BOM after a stack of testing I'm genuinely shocked to discover it's PowerShell doing the damage.

The powershell script is even running on a Windows box!! I have this code in a script:

function Update-SourceDataFileVersion
{
  Param ([string]$Version)

  foreach ($o in $input) 
  {
    Write-output $o.FullName 
    $TmpFile = $o.FullName + ".tmp" 

     get-content $o.FullName | 
        %{$_ -replace 'x.x.x.x', $Version } > $TmpFile

     move-item $TmpFile $o.FullName -force
  }
}

And I know I need to specify an encoding. From looking at other answers on SO I should be able to do this but I just cannot find the right syntax.

I've tried:

function Update-SourceDataFileVersion
{
  Param ([string]$Version)

  foreach ($o in $input) 
  {
    Write-output -Encoding utf8 $o.FullName 
    $TmpFile = $o.FullName + ".tmp" 

     get-content -Encoding utf8 $o.FullName | 
        %{$_ -replace 'x.x.x.x', $Version } > $TmpFile -Encoding utf8

     move-item $TmpFile $o.FullName -force
  }
}

As per the other examples but that just results in empty files.

How can I stop powershell from breaking my files and setting the right encoding? I'm running PS 5.1

Jammer
  • 9,134
  • 8
  • 56
  • 106
  • _Windows PowerShell_ is, unfortunately, wildly inconsistent with respect to default character encodings, unlike _PowerShell (Core) 7+_, which now consistently defaults to BOM-less UTF-8. Note that while executing `$PSDefaultParameterValues['*:Encoding'] = 'utf8'` first _can_ make Windows PowerShell v5.1's `>` operator produce UTF-8 files, they will invariably have a _BOM_ - see [this answer](https://stackoverflow.com/a/40098904/45375). – mklement0 Feb 05 '21 at 21:02

1 Answers1

2

In your example you are using redirection > to save the output to a file. > it's an operator and doesn't support options. Thus setting the encoding doesn't make any difference.

Instead you want to use the Out-File cmdlet

function Update-SourceDataFileVersion
{
  Param ([string]$Version)

  foreach ($o in $input) 
  {
    $TmpFile = $o.FullName + ".tmp" 

     get-content -Encoding utf8 $o.FullName | `
        %{$_ -replace 'x.x.x.x', $Version } | `
        Out-File -FilePath $TmpFile -Encoding utf8

     move-item $TmpFile $o.FullName -force
  }
}

BTW: I think that you use Write-Output in the wrong way: it is used to pass an object along a pipeline, not to write to a file. If you what to log the file name you should use Write-Host instead

Francesco Montesano
  • 7,562
  • 2
  • 38
  • 62
  • Good grief. I don't touch PS much, so many gotchas. I found this works too ```$OutputEncoding = [System.Console]::OutputEncoding = [System.Console]::InputEncoding = [System.Text.Encoding]::UTF8 $PSDefaultParameterValues['*:Encoding'] = 'utf8'``` – Jammer Feb 05 '21 at 20:35
  • This still begs the question why PS 5.1 on the latest Windows 10, reads a UTF8 file and thinks outputting a bizzare UCS-2 LE BOM file as the default behaviour is just plain nuts. – Jammer Feb 05 '21 at 20:36
  • Because "Windows"! I think that the problem there is not powershell, but Windows default encodings. I hat enough encoding issues, Both with powershell and java, that could be easily solved explicitly using UTF8. As a Linux guy, I've spent way to many hours of my life debugging such problems – Francesco Montesano Feb 05 '21 at 20:38
  • 1
    It's really not obvious when you read ">" is a shortcut to xxxx. But it doesn't then tell you that key caveat that you can't then use options and it just leads to confusion. Thanks for your help! – Jammer Feb 05 '21 at 20:40
  • 1
    You're welcome. Regarding the shortcuts: I'm making an habit of non using them. It gets a bit more worthy, but it saves from some issues – Francesco Montesano Feb 05 '21 at 20:47
  • 1
    _Windows PowerShell_ is, unfortunately, wildly inconsistent with respect to default character encodings, unlike _PowerShell (Core) 7+_, which now consistently defaults to BOM-less UTF-8 - see [this answer](https://stackoverflow.com/a/40098904/45375). – mklement0 Feb 05 '21 at 21:00
  • 1
    As an aside: If the input objects are _strings_, it is faster to use [`Set-Content`](https://docs.microsoft.com/powershell/module/microsoft.powershell.management/set-content) instead of [`Out-File`](https://docs.microsoft.com/powershell/module/microsoft.powershell.utility/out-file) / `>`, which may matter when writing large files - see [this answer](https://stackoverflow.com/a/60158247/45375). Note: _Windows PowerShell_'s `Set-Content` defaults to the active ANSI code page. – mklement0 Feb 05 '21 at 21:21
  • I did find your answer before posting this but I just couldn't get the syntax write. The encoding mess is, lets face, a sick joke. It's not remotely funny. – Jammer Feb 05 '21 at 21:32
  • I mean I wasted a day on this thinking it was Gitlab ... having an issue like this with Powershell didn't even enter my realms of possibility. – Jammer Feb 05 '21 at 21:34