-1

This question is related to another one which went the perl way but found much difficulties due to Windows bugs. (see Perl or Powershell how to convert from UCS-2 little endian to utf-8 or do inline oneliner search replace regex on UCS-2 file )

I would like the POWERSHELL equivalent of simple perl regex on a little endian UCS-2 format file (UCS-2LE is same as UTF-16 Little Endian). ie:

perl -pi.bak -e 's/search/replace/g;' MyUCS-2LEfile.txt

You will probably need to tell Powershell gci that input file is ucs2-le and that you want output file in same UCS-2LE (windows CR LF) format also etc.

htfree
  • 181
  • 12
  • @lotpings thanks for edit i finally noticed i'm supposed to "indent" code, will try it next time. Can we see who downvotes and for what reason? – htfree May 07 '19 at 22:22
  • Your other question let open as does this one which shell/cmd you actually used. If the file `MyUCS-2LEfile.txt` is UTF16LE encoded with a BOM, simply do `(Get-Content MyUCS-2LEfile.txt) | Set-Content MyUTF8file.txt -Encoding UTF8` –  May 07 '19 at 22:24
  • pretty sure no BOM and think i tried similar as you posted and won't work right. I was using cmd on Windows i thought it was clear, and using windows activeperl. I would remove Powershell from previous question and let this one be powershell.You can see my last comments there, windows cmd does not even seem to allow chcp 1200 which is utf-16le/ucs-2le i referenced links claiming that powershell on the contrary does supposedly handle ucs2-le .. – htfree May 07 '19 at 22:26
  • The `Get-Content` as well as `Set-Content` / `Out-File` cmdlets allow the `-Encoding` parameter. No want of a regex… – JosefZ May 07 '19 at 22:26
  • @JosefZ please give complete gci oneliner, i'm almost zero with powershell so afraid might mess up some small syntax, tx. Oh and i had tried powershell -Command "(Get-Content C:\dir1\dir2\test.txt).replace('foo', 'bar') | Set-Content -encoding utf8 C:\dir1\dir2\test.txt" which does not work in converting file properly. But the goal is to edit the ucs-2le file and result saved as ucs-2le – htfree May 07 '19 at 22:28
  • 1
    This is not a guessing game, use `Get-Content MyUCS-2LEfile.txt -Head 10| Format-Hex` to ***see*** what is the matter. The `-Encoding` parameter is the same for all the cmdlets. If the input file **is** UTF16LE encoded try `-Encoding unicode` and when saving with `Set-Content -Encoding UTF8` BTW don't mix up the aliases, **look them up** with `Get-Alias` –  May 07 '19 at 23:01
  • To be Unicode compliant you have to cope with or without the BOM. Seems your language is not Unicode compliant. – Noodles May 08 '19 at 01:00
  • I am "beyond" exhausted for many reasons but I finally solved my main problem with perl. Though if someone provides a complete powershell command that I can paste on the dos cmd to replace foo with bar in a usc-2le file I will test/confirm it. – htfree May 08 '19 at 02:13
  • You don't seem to understand anything about windows. I have a RICHEDIT control. I don't have to tell it what I'm opening, it knows (ANSI, LE, BE, UTF8). Nothing in your question addresses Windows' issues as you don't mention any. Yet you **boldly** claim there are bugs in Windows. Yet your architecture is not professional. **Hint** Open the file correctly. – Noodles May 08 '19 at 09:01
  • @Noodles Sorry I should have clarified I used the term windows bugs very "loosely" referring to bugs encountered while on windows dealing with ucs2le. See some of my links whether here or in perl linked question, chcp 1200 for example is not supported except in managed apps etc so whether complication arises out of operating system shortcoming or tools/utils i used I don't know. But note that i had issues with powershell ucs2le manipulation without specifying format though i'm no powershell expert as already said. – htfree May 08 '19 at 21:33
  • @Noodles maybe you can help me out a bit with powershell command that does not add a BOM to utf-16le/ucs-2le? I managed preventing BOM if output to utf-8 (using https://stackoverflow.com/questions/5596982/using-powershell-to-write-a-file-in-utf-8-without-the-bom ) but nothing i tried gets rid of BOM on powershell saved file even though input file has No BOM. – htfree May 08 '19 at 22:08
  • @Noodles I did find this interesting "PS does add a BOM to outputted files (we need to investigate exactly where)" https://github.com/PowerShell/PowerShell/issues/707 – htfree May 08 '19 at 23:11
  • You have lots of ways to read/write files in Windows. You have API calls, you have COM shell calls, COM libraries (ie FSO object and ADODB streams), and the .NET framework. PS uses the .NET framework but can use any of the others. The command prompt is based on 8 bit OEM fonts so batch files will always work correctly. Windows automatically translates. API functions come ANSI AND UTF16. Send a string from a UTF16 program to an ANSI program it gets converted to ANSI automatically. Everything will end up using file api calls. **Power Shell can use all of these methods.** – Noodles May 08 '19 at 23:27
  • Also text file functions are different to other file functions. When you use a textfile functions you are telling windows to use locale and/or encodings for interpreting the file. – Noodles May 08 '19 at 23:39
  • Well all I wanted is a powershell expert to help me make a PS perl equivalent simple regex on an ucs-2le (no-bom) encoded file and output the result in same format. As I was saying in my comments to the proposed answer, I'd like someone show how to do it without ending up with ucs-2le file "with" a BOM. – htfree May 09 '19 at 03:39
  • As for asking about downvotes; no, you can't tell who downvoted or why, voting is anonymous; and this is a very common FAQ. See e.g. https://meta.stackexchange.com/questions/63895/who-has-downvoted-me-and-why – tripleee May 11 '19 at 15:14
  • 1
    And not my downvote, but I can see several problems with your question, including the lack of any attempt to write the requested code yourself, and the imprecise and vaguely accusatory allegations about unspecified bugs. – tripleee May 11 '19 at 15:16
  • @tripleee I agree the question may not have been phrased perfectly, I already replied to NOODLES in regards to that if you read our discussion above. But sorry your laziness accusation is "way off base", I've worked many hours on this trying to find a solution and tried a gazillion things, if you read my replies to LIT to his answer you can see I came up with something that WORKS but problem is it creates output file with a BOM while original input file is without. See my first reply to LIT's answer and you'll see what I came up with after lots of trial/error... – htfree May 11 '19 at 21:58
  • I'm not accusing, just observing. If your question doesn't actually mention these things *in the question itself*, perhaps you should [edit] it so it does. Of course, the only way to demonstrate that you tried "a gazillion things" is to show at least some of the things you tried. Remember, your question should remain self-contained. Anyway, this may be moot now for this particular question, but I hope it can offer helpful guidance for your future use of this site. – tripleee May 12 '19 at 07:58
  • @tripleee Well i was a bit too busy trying even "more" gazillion things as I posted question and after too (so didn't spend much time recounting all I tried, lol when you do, others blame you for too long a question) and was "Exhausted" in the process as can be glimpsed from my efforts and discussion in the Related PERL question I Linked in this question. But also i had thought this request would be dead simple for a PS expert and wouldn't be "a lot" to ask for...I don't really want to make question longer now to clutter it, better clean, too many comments already! :) Glad all's good now,tx! – htfree May 12 '19 at 08:38
  • You were asking what caused someone to downvote your question and I offered some suggestions for how to improve it to prevent it from attracting more downvotes. Of course, those reasons are still present until you actually remove them; perhaps it would be wise to improve the quality still, if only as an exercise for posting better questions in the future (and of course, demonstrate responsible site membership). – tripleee May 12 '19 at 08:46
  • @tripleee I agree and thanks for the feedback but I also think people should use a little restraint before assuming the worst of people and give a little "benefit of the doubt" before judging, right now they don't even have to ask if I tried at all since they can read all these comments themselves (here and on the linked question). I'm concerned about making the question "cluttered" with a bunch of things I tried that did 'not work' properly for various reasons. Thanks and in future I'll do different but now that we've discussed and Clarified All in comments, prefer keep question clean :) thx! – htfree May 12 '19 at 16:35

1 Answers1

1

This will output the file after regex. The output file does -not- begin with a BOM. This should work for small files. For large files, it may require changes to be speedy.

$fin = 'C:/src/t/revbom-in.txt'
$fout = 'C:/src/t/revbom-out.txt'
if (Test-Path -Path $fout) { Remove-Item -Path $fout }

# Create a file for input
$UCS2LENoBomEncoding = New-Object System.Text.UnicodeEncoding $False, $False
[System.IO.File]::WriteAllLines($fin, "now is the time`r`nwhen was the time", $UCS2LENoBomEncoding)

# Read the file in, replace string, write file out
[System.IO.File]::ReadLines($fin, $UCS2LENoBomEncoding) |
    ForEach-Object {
        [System.IO.File]::AppendAllLines($fout, [string[]]($_ -replace 'the','a'), $UCS2LENoBomEncoding)
    }

HT: @refactorsaurusrex at https://gist.github.com/refactorsaurusrex/9aa6b72f3519dbc71f7d0497df00eeb1 for the [string[]] cast

NB: mklement0 at https://gist.github.com/mklement0/acb868a9f15d9a34b6e88fc874b3851d

NB: If the source file is HTML, please see https://stackoverflow.com/a/1732454/447901

lit
  • 10,936
  • 7
  • 49
  • 80
  • I had errors with your code for some reason from the win7 cmd line. But I seem to have gotten i tto work ok with this: powershell -Command "(gc myWinUCS-2LEfile.txt -Encoding Unicode) | ForEach-Object { $_.replace('foo', 'bar')} | Out-File -encoding Unicode Regexed_usc-2le_utf-16-le.txt" – htfree May 08 '19 at 03:12
  • I think the key is simply encoding to Unicode, previously i was trying to specify USC-2LE or UTF-16LE which it didn't understand, it only knows ASCII or Unicode – htfree May 08 '19 at 03:13
  • See the `-Encoding` section from `help Get-Content -Full` regarding which encodings PowerShell understands. Here is a way to avoid writing a BOM. https://stackoverflow.com/a/5596984/447901 – lit May 08 '19 at 03:23
  • Yes I've modified my previous powershell I used and if I pass New-Object System.Text.UTF8Encoding $False and use [System.IO.File]::WriteAllLines I can then create a good UTF-8 NO BOM conversion of my UCS-2LE file. But how can I create a NO BOM version of my UCS2-LE file? I've heard said BOM's no issue with unicode and seems I can use the BOM ucs2-le file ok but still curious how to output UCS2-LE with No Bom. – htfree May 08 '19 at 06:01
  • How about trying `System.Text.UnicodeEncoding`. https://docs.microsoft.com/en-us/dotnet/api/system.text.unicodeencoding?view=netframework-4.8 – lit May 08 '19 at 13:02
  • i had tried it already but i can't recall if it didn't work right or gave me errors (but perhaps syntaxical who knows) as if it didn't like it, i'll retry it again to confirm. – htfree May 08 '19 at 20:30
  • I would think that telling the constructor $false on BOM would work the same way as UTF-8. – lit May 08 '19 at 20:39
  • No it doesn't unfortunately. If I leave all the same but change System.Text.UTF8Encoding $False to System.Text.UnicodeEncoding $False , I get error New-Object : Cannot find an overload for "UnicodeEncoding" and the argumentcount: "1". Only if i change that line to [System.Text.Encoding]::UNICODE will it work, and it will give USC-2LE but with BOM. – htfree May 08 '19 at 21:46
  • @LIT_LIT if you change your code to "save output to file" and it does regex on ucs2le and saves to ucs2le, I'll click to accept your answer, as long as if it adds a BOM like my version does also then acknowledge in the answer so later others may notice and possibly contribute a version that does not add a BOM to resulting regex modified ucs2le saved file. – htfree May 10 '19 at 02:03
  • @htfree - Please have another look. Is this what you wanted? – lit May 10 '19 at 13:51
  • @lit_lit The problem is that if I comment out your "now is the time" line and use my pre-existing UCS-2LE file (without BOM) which I need to edit, then your regex does not work and the output file is UCS-2LE but adds a BOM and has a NUL added after every character. I was meaning if you could add something at least as functional as the version i included in my first reply to you but to mention it is not perfect if it has the same flaw as mine, ie adding a BOM to output UCS-2LE file instead of leaving output file the "same" format as input file, "without BOM". – htfree May 11 '19 at 05:16
  • @htfree - Ok, let's try one more time. – lit May 11 '19 at 15:03
  • @Lit_Lit ok sure thanks, and if you can't find a solution that works without adding a BOM to the UCS-2LE, then just put a command that does similar to mine or you can even put exact one I used if you can't find anything better and I'll still give you credit as accepted solution since you tried the best to help me. Update-Oh I hadn't noticed you added new code, will test soon, tx! – htfree May 11 '19 at 22:03
  • @Lit_Lit Congrats! It seems to work, strange that when i had tried that method with the "$False" trick it worked for utf8 but not unicode, not sure what syntax i had done wrong. Anyway thanks for sticking with it and providing a working solution! – htfree May 11 '19 at 22:40
  • @Lit_Lit Awesome! Based on your code I was able to fix my own command that gave errors before by changing my "$False" to double false: "$False, $False" and now it works too! ie after adding the extra $False my command is now: powershell -Command "$myfiledata = ((gc UCS2-LEwinCRLFnobom.txt -Encoding Unicode) | ForEach-Object { $_.replace('foo', 'bar')}); $NoBom = New-Object System.Text.UnicodeEncoding $False, $False ; [System.IO.File]::WriteAllLines('UCS2-LEwinCRLFnobom.txt', $myfiledata, $NoBom)" – htfree May 11 '19 at 23:02