0

I'm breaking my head: D I am trying to encode a text file that will be saved in the same way as Notepad saves It looks exactly the same but it's not the same only if I go into the file via Notepad and save again it works for me what could be the problem with encoding? Or how can I solve it? Is there an option for a command that opens Notepad and saves again?

i use now

(Get-Content 000014.log) | Out-FileUtf8NoBom ddppyyyyy.txt

and after this

Get-ChildItem ddppyyyyy.txt | ForEach-Object {
  # get the contents and replace line breaks by U+000A
  $contents = [IO.File]::ReadAllText($_) -replace "`r`n?", "`n"
  # create UTF-8 encoding without signature
  $utf8 = New-Object System.Text.UTF8Encoding $false
  # write the text back
  [IO.File]::WriteAllText($_, $contents, $utf8)
}
  • 1
    Which PowerShell version you are using? What is the actual problem? – wp78de Oct 25 '20 at 18:35
  • The summary seems to be: there is nothing wrong with the code in the question, the true problem was embedded `NUL` characters in the files, which caused problems in `R`, and which opening and resaving in Notepad _implicitly_ removed, thereby resolving the problem (assuming that simply _discarding_ these `NUL`s works as intended). – mklement0 Oct 26 '20 at 13:29
  • Allow me to give you the standard advice to newcomers in the next comment; I also encourage you to revisit your previous questions. – mklement0 Oct 26 '20 at 21:39
  • If you [accept](https://meta.stackexchange.com/a/5235/248777) an answer, you will help future readers by showing them what solved your problem. To accept an answer, click the large ✓ symbol below the large number to the left of the answer (you'll get 2 reputation points). If you have at least 15 reputation points, you can also up-vote other helpful answers (optionally also the accepted one). If your problem isn't solved yet, provide feedback, or, if you found the solution yourself, [self-answer](http://stackoverflow.com/help/self-answer). – mklement0 Oct 26 '20 at 21:39

2 Answers2

1

When you open a file with notepad.exe it autodetects the encoding (or do you open the file explicitly File->Open.. as UTF-8?). If your file is actually not UFT-8 but something else notepad could be able to work around this and converts it to the required encoding when the file is resaved. So, when you do not specify the correct input encoding in your PoSh script things are will go wrong.

But that's not all; notepad also drops erroneous characters when the file is saved to create a regular text file. For instance, your text file might contain a NULL character that only gets removed when you use notepad. If this is the case it is highly unlikely that your input file is UTF-8 encoded (unless it is broken). So, it looks like your problem is your source file is UTF16 or similar; try to find the right input encoding and rewrite it, e.g. UTF-16 to UTF-8

Get-Content file.foo -Encoding Unicode | Set-Content -Encoding UTF8 newfile.foo

Try it like this:

Get-ChildItem ddppyyyyy.txt | ForEach-Object {
  # get the contents and replace Windows line breaks by U+000A
  $raw= (Get-Content -Raw $_ -Encoding UTF8) -replace "`r?`n", "`n" -replace "`0", ""
  # create UTF-8 encoding without BOM signature
  $utf8NoBom = New-Object System.Text.UTF8Encoding $false
  # write the text back
  [System.IO.File]::WriteAllLines($_, $raw, $utf8NoBom)
}

If you are struggling with the Byte-order-mark it is best to use a hex editor to check the file header manually; checking your file after I have saved it like shown above and then opening it with Notepad.exe and saving it under a new name shows no difference anymore:

enter image description here

The hex-dumped beginning of a file with BOM looks like this instead:

enter image description here

Also, as noted, while your regex pattern should work it want to convert Windows newlines to Unix style it is much more common and safer to make the CR optional: `r?`n

Als noted by mklement0 reading the file using the correct encoding is important; if your file is actually in Latin1 or something you will end up with a broken file if you carelessly convert it to UTF-8 in PoSH. Thus, I have added the -Encoding UTF8 param to the Get-Content Cmdlet; adjust as needed.

wp78de
  • 16,078
  • 6
  • 34
  • 56
  • I can not figure out what the problem is it looks exactly the same in encoding but only after I save it again in Notepad can I use it the way I want It does not bring me the same saving of Notefad utf-8, The problem is I want to do this automatically I can not open the Notepad manually and save every time – ranonfran ranranonf Oct 25 '20 at 17:50
  • I think the problem is not all the encoding when I take the original file created and open it in Notepad and save it without changing the encoding or anything I can use it what does it change in the file? How can I check? – ranonfran ranranonf Oct 25 '20 at 20:13
  • 1
    We still don't know what's wrong with your file if you do not open it with notepad? Do you see broken characters or what is it? – wp78de Oct 25 '20 at 20:24
  • I want to use to use the file in R, but if I do not open it first in Notepad and saves it I get an error in R = Error in stri_encode(txt, encoding, "UTF-8") : embedded nul in string: '\021' In addition: There were 50 or more warnings (use warnings() to see the first 50), If I just open it on Notepad and save without changing the encoding or anything it does work – ranonfran ranranonf Oct 25 '20 at 23:27
  • 1
    @ranonfranranranonf aha, maybe your strings contain NULL chars. Try to filter those as well; append after the first replace: `-replace "`0", ""` Maybe it's a different control character but I am pretty sure this is the actual issue. – wp78de Oct 25 '20 at 23:52
  • hi I found something that helps me read the text but is it possible to do it without destroying the order of the lines? (Get-Content ssasaas11.log -Raw)-replace '[^ -~\t]' | Set-Content dssdpppp1111OUT.txt -Force – ranonfran ranranonf Oct 26 '20 at 00:07
  • If your file is actually UTF16 try to open it like this `Get-Content file.foo -Encoding Unicode | Set-Content -Encoding UTF8 newfile.foo` – wp78de Oct 26 '20 at 00:48
  • Your hex dump looks almost like binary. Is this UTF-32. Where is it from? – wp78de Oct 26 '20 at 01:05
  • 1
    @wp78de, it looks like you were right about `NUL`s, though they seem to be embedded in an otherwise single-byte encoding file, judging by a comment on my answer; in that case, your `-replace "\`0"` recommendation should fix the problem. – mklement0 Oct 26 '20 at 13:27
1

Update: There is nothing wrong with the code in the question, the true problem was embedded NUL characters in the files, which caused problems in R, and which opening and resaving in Notepad implicitly removed, thereby resolving the problem (assuming that simply discarding these NULs works as intended) - see also: wp78de's answer.

Therefore, modifying the $contents = ... line as follows should fix your problem:

$contents = [IO.File]::ReadAllText($_) -replace "`r`n", "`n" -replace "`0"

Note: The code in the question uses the Out-FileUtf8NoBom function from this answer, which allows saving to BOM-less UTF-8 files in Windows PowerShell; it now supports a -UseLF switch, which would simplify the OP's command to (additional problems notwithstanding):

Get-Content 000014.log | Out-FileUtf8NoBom ddppyyyyy.txt -UseLF

  • There's a conceptual flaw in your regex, though it is benign in this case: instead of "`r`n?" you want "`r?`n" (or, expressed as a pure regex, '\r?\n') in order to match both CRLF ("`r`n") and LF-only ("`n") newlines.

    • Your regex would instead match CRLF and CR-only(!) newlines; however, as wp78de points out, if your input file contains only the usual CRLF newlines (and not also isolated CR characters), your replacement operation should still work.

    • In fact, you don't need a regex at all if all you need is to replace CRLF sequences with LF: -replace "`r`n", "`n"

  • Assuming that your original input files are ANSI-encoded, you can simplify your approach as follows, without the need to call Out-FileUtf8NoBom first (assumes Windows PowerShell):

# NO need for Out-FileUtf8NoBom - process the ANSI-encoded files directly.
Get-ChildItem *SomePattern*.txt | ForEach-Object {
  # Get the contents and make sure newlines are LF-only
  # [Text.Encoding]::Default is the encoding for the active ANSI code page
  # in Windows PowerShell.
  $contents = [IO.File]::ReadAllText(
    $_.FullName, 
    [Text.Encoding]::Default
  ) -replace "`r`n", "`n"
  # Write the text back with BOM-less UTF-8 (.NET's default)
  [IO.File]::WriteAllText($_.FullName, $contents, $utf8)
}

Note that replacing the content of files in-place bears a risk of data loss, so it's best to create backup copies of the original files first.


Note: If you wanted to perform the same operation in PowerShell [Core] v6+, which is built on .NET Core, the code must be modified slightly, because [Text.Encoding]::Default no longer reflects the active ANSI code page and instead invariably returns a BOM-less UTF-8 encoding.

Therefore, the $contents = ... statement would have to change to (note that this would work in Windows PowerShell too):

  $contents = [IO.File]::ReadAllText(
    $_.FullName,
    [Text.Encoding]::GetEncoding(
      [cultureinfo]::CurrentCulture.TextInfo.AnsiCodePage
    )
  ) -replace "`r`n", "`n"
mklement0
  • 245,023
  • 45
  • 419
  • 492
  • this the orginal file dont work Format-Hex ```` 00000000 B1 98 D0 11 A9 00 01 91 A6 00 00 00 00 00 00 02 ±Ð.©..¦....... 00000010 00 00 00 00 4A 44 41 54 41 3A 68 74 74 70 73 5F ....JDATA:https_ ```` and this after notepad save work ```` 00000000 B1 98 D0 11 A9 20 01 91 A6 20 20 20 20 20 20 02 ±Ð.© .¦ . 00000010 20 20 20 20 4A 44 41 54 41 3A 68 74 74 70 73 5F JDATA:https_ `````````` but i dont know how get Same result in powershell – ranonfran ranranonf Oct 26 '20 at 00:29
  • @ranonfranranranonf: It looks like wp78de was correct: your file contains `NUL` characters (`"\`0"`), and if it's sufficient to _remove_ them (which Notepad apparently implicitly does), simply apply another `-replace` operation: `-replace "\`0"` – mklement0 Oct 26 '20 at 13:25