0

This is a three part question, all related. The context is this: I have a need to do find and replace in arbitrary files, which can potentially have varied encoding, and can potentially be VERY large (upwards of 500 MB). My current code uses Get-Content and Add-Content, and it's both a memory hog and horribly non performant on large files.

I have code that successfully identifies the source file encoding using

$reader = [IO.StreamReader]::New($sourceFile, $true)
$reader.Peek() > $null
$encoding = $reader.CurrentEncoding

The result is in the form System.Text.UTF8Encoding

1: I believe a Constructor in this form is what I need for the StreamWriter

$writer = [IO.StreamWriter]::New($tempFile, Encoding.UTF8)

I am hoping there is a built in programatic way to get from the string System.Text.UTF8Encoding to the correct form needed for the Constructor. Or do I need to make my own hash table or find/replace to handle this?

2: I have used Encoding.UTF16 in a test, but the resultant file is still UTF8. I recently was educated on the subtlety of StreamReader not populating .CurrentEncoding until some kind of Read happens (thus the .Peek() in the code above) and I wonder if there is some similar issue with StreamWriter?

3: My understanding is that some encodings need an end of file written, but some don't. For example, when writing a UTF8 file, the file sizes don't match until $writer.WriteLine("rn") is included as the last WriteLine. Can anyone point me to a reference that spells out which encodings need what endings?

Thanks!

UPDATE: I found this thread, which got me started. And, when $reader.CurrentEncoding is System.Text.UTF8Encoding then $encoding = New-Object $reader.CurrentEncoding $False works, but when $reader.CurrentEncoding is System.Text.UTF32Encoding I get an error about

Cannot find an overload for "UTF32Encoding"

Which seems odd, because this can't be a typo or something, I am getting the string programmatically from the file.

This suggests that all of these should work, I think. But only the first works. All the rest throw the Overload error.

$encoding = New-Object System.Text.UTF8Encoding $False
$encoding = New-Object System.Text.UTF32Encoding $False
$encoding = New-Object System.Text.ASCIIEncoding $False
$encoding = New-Object System.Text.UNICODEEncoding $False

PROGRESS So, it seems that the $false in those New-Objects are for the -strict argument, which only applies with -comObject. And only System.Text.UTF8Encoding works as a COM object. All of them work with the -typeName argument, which doesn't have a -strict argument. So...

$encoding = New-Object -typeName:System.Text.UTF8Encoding
$encoding = New-Object -typeName:System.Text.UTF32Encoding
$encoding = New-Object -typeName:System.Text.ASCIIEncoding
$encoding = New-Object -typeName:System.Text.UNICODEEncoding

works. Also, I changed the StreamWriter Constructor to this

$writer = [IO.StreamWriter]::New($destinationFile, $true, (New-Object $reader.CurrentEncoding))

That boolean keeps the stream open. And... at this point I think I have a working function!

Gordon
  • 4,559
  • 2
  • 22
  • 51
  • Does `$writer = [IO.StreamWriter]::New($tempFile, $reader.CurrentEncoding)` work? or `$reader.CurrentEncoding.GetEncoder()` ? – gvee Jan 08 '18 at 14:46
  • What encoding are you aiming for? The carriage return + newline is a Windows thing. In PowerShell, you should use the proper escapes (`\`r\`n`) – Maximilian Burszley Jan 08 '18 at 14:49
  • @gvee `$reader.CurrentEncoding` definitely doesn't work, as that produces a string, not the required object. `$reader.CurrentEncoding.GetEncoder()` doesn't throw an error, but it produces a UTF8 file, even when `$reader.CurrentEncoding` = *System.Text.UTF32Encoding*. And as I added in the original post, I am seeing errors with a different approach and *System.Text.UTF32Encoding* when *System.Text.UTF8Encoding* works fine. – Gordon Jan 08 '18 at 15:02
  • @TheIncorrigable1 Sorry, I didn't notice that my cut n paste didn't copy correctly. I am indeed using ``r`n`. As for encoding, I want to support any valid Windows encoding, since I am working with unknown files. Is this something I can/should add for any encoding, or am I correct it's only needed for some? – Gordon Jan 08 '18 at 15:05
  • And I still didn't get good formatting. I guess I need more escapes to get it to format right in a comment. – Gordon Jan 08 '18 at 15:14

1 Answers1

0

So, to compress all that into a more succinct answer...

$reader = [IO.StreamReader]::New($sourceFile, $true)
$reader.Peek() > $null
$writer = [IO.StreamWriter]::New($destinationFile, $true, (New-Object -typeName:$reader.CurrentEncoding))

This works a treat.

The question of which encodings need special end of file treatment is still open.

Gordon
  • 4,559
  • 2
  • 22
  • 51