This is a three part question, all related. The context is this: I have a need to do find and replace in arbitrary files, which can potentially have varied encoding, and can potentially be VERY large (upwards of 500 MB). My current code uses Get-Content and Add-Content, and it's both a memory hog and horribly non performant on large files.
I have code that successfully identifies the source file encoding using
$reader = [IO.StreamReader]::New($sourceFile, $true)
$reader.Peek() > $null
$encoding = $reader.CurrentEncoding
The result is in the form System.Text.UTF8Encoding
1: I believe a Constructor in this form is what I need for the StreamWriter
$writer = [IO.StreamWriter]::New($tempFile, Encoding.UTF8)
I am hoping there is a built in programatic way to get from the string System.Text.UTF8Encoding to the correct form needed for the Constructor. Or do I need to make my own hash table or find/replace to handle this?
2: I have used Encoding.UTF16 in a test, but the resultant file is still UTF8. I recently was educated on the subtlety of StreamReader not populating .CurrentEncoding
until some kind of Read happens (thus the .Peek()
in the code above) and I wonder if there is some similar issue with StreamWriter?
3: My understanding is that some encodings need an end of file written, but some don't. For example, when writing a UTF8 file, the file sizes don't match until $writer.WriteLine("
rn")
is included as the last WriteLine. Can anyone point me to a reference that spells out which encodings need what endings?
Thanks!
UPDATE:
I found this thread, which got me started. And, when $reader.CurrentEncoding
is System.Text.UTF8Encoding then $encoding = New-Object $reader.CurrentEncoding $False
works, but when $reader.CurrentEncoding
is System.Text.UTF32Encoding I get an error about
Cannot find an overload for "UTF32Encoding"
Which seems odd, because this can't be a typo or something, I am getting the string programmatically from the file.
This suggests that all of these should work, I think. But only the first works. All the rest throw the Overload error.
$encoding = New-Object System.Text.UTF8Encoding $False
$encoding = New-Object System.Text.UTF32Encoding $False
$encoding = New-Object System.Text.ASCIIEncoding $False
$encoding = New-Object System.Text.UNICODEEncoding $False
PROGRESS So, it seems that the $false in those New-Objects are for the -strict argument, which only applies with -comObject. And only System.Text.UTF8Encoding works as a COM object. All of them work with the -typeName argument, which doesn't have a -strict argument. So...
$encoding = New-Object -typeName:System.Text.UTF8Encoding
$encoding = New-Object -typeName:System.Text.UTF32Encoding
$encoding = New-Object -typeName:System.Text.ASCIIEncoding
$encoding = New-Object -typeName:System.Text.UNICODEEncoding
works. Also, I changed the StreamWriter Constructor to this
$writer = [IO.StreamWriter]::New($destinationFile, $true, (New-Object $reader.CurrentEncoding))
That boolean keeps the stream open. And... at this point I think I have a working function!