0

I need to merge all txt-files in a certain folder on my computer. There's hundreds of them and they all have a different name, so any code where you had to manually type the name of the files in order to merge them was not working for me. The files are in "UTF-8"-encoding and contain emojis and characters from different languages (such as Cyrillic script) as well as characters with accents and so on (e.g. é, ü, à...). A fellow stackoverflow-user was so kind as to give me the following code to run in Powershell:

(gc *.txt) | out-file newfile.txt -encoding utf8

It works wonderfully for merging the files. However, it actually gives me a txt-file with "UTF-8 with BOM"-encoding, instead of with "UTF-8"-encoding. Furthermore, all emojis and special characters have been removed and exchanged for others, such as "ü" instead of "ü". It is very importatnt for what I am doing that these emojis and special characters remain.

Could someone help me with tweaking this code (or suggesting a different one) so it gives me a merged txt-file with "UTF-8"-encoding that still contains all of the special characters? Please keep in mind that I am a layperson.

Thank you so much in advance for your help and kind regards!

LDG
  • 21
  • 7
  • Have you tried [`UTF8NoBOM`](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/out-file?view=powershell-6)? `Get-Content` also supports encoding specification, which the sample doesn't utilize. – vonPryz Nov 08 '19 at 09:56
  • @vonPryz Firstly, thank you for reacting! I tried it out, but ```(gc *.txt) | out-file newfile.txt -encoding UTF8NoBOM``` only gives me an error that: ```Out-File: Cannot validate argument on parameter 'Encoding'. The argument "UTF8NoBOM" does not belong to the set "unknown;string;unicode;bigendianunicode;utf8;utf7;utf32;ascii;default;oem" specified by the ValidateSet attribute. Supply an argument that is in the set and then try the command again.``` – LDG Nov 08 '19 at 10:03
  • The NoBOM requires Powershell 6; you've got older version. Anyway, does it help if you specify UTF8 to `Get-Content`? Try also a [work-around](https://stackoverflow.com/q/5596982/503046) via .Net. – vonPryz Nov 08 '19 at 10:12
  • @vonPryz Oh, that explains my problem at least partially. The code I used was ```(gc *.txt) | out-file newfile.txt -encoding UTF8```. If that is what you mean then unfortunately it didn't work. It always gives me a txt-file with "UTF-8 with BOM". I looked at the work-around (thank you!) you sent me, but there's a lot of information there and I'm not really sure what to use. – LDG Nov 08 '19 at 10:21
  • For PS 5 you at least need (gc *.txt -encoding utf8) if the input files are utf8nobom. But PS 5 can't save as utf8nobom (.net ?). – js2010 Nov 08 '19 at 13:13
  • @js2010 Hi! I tried to do what you said and entered ```(gc *.txt) -encoding utf8 | out-file newfile.txt -encoding utf8```, but that didn't work at all. I got the following error: ```Unexpected token '-encoding' in expression or statement.``` and ```Unexpected token 'utf8' in expression or statement.``` Could you tell me why all of my special characters (emojis, accents, umlaut, cyrillic...) in the separate txt-files are changed when the files get merged into one single txt-file? That's my biggest problem. Is there something I can do about that? – LDG Nov 08 '19 at 13:33
  • @js2010 Thank you for your help! Please understand that this is the first time I've worked with Powershell - I have never learned or been taught how to use it, so I'm doing the best I can with what I have. – LDG Nov 08 '19 at 14:40
  • Ok. In this case, PS 6 or 7 would be more straightfoward. – js2010 Nov 08 '19 at 15:23
  • @js2010 Sorry I didn't reply sooner! IT WORKED! I am so happy!!!!!! You solved my problem! Thank you so much! I used ```(gc *.txt -encoding utf8) | out-file newfile.txt -encoding utf8```, like you advised and it worked! All of the emojis and so on remained! Thank you so much for helping me!!!! – LDG Nov 13 '19 at 07:55

2 Answers2

2

In PowerShell < 6.0, the Out-File cmdlet does not have a Utf8NoBOM encoding.
You can however write Utf8 text files without BOM using .NET:

Common for all methods below

$rootFolder = 'D:\test'  # the path where the textfiles to merge can be found
$outFile    = Join-Path -Path $rootFolder -ChildPath 'newfile.txt'

Method 1

# create a Utf8NoBOM encoding object
$utf8NoBom = New-Object System.Text.UTF8Encoding $false  # $false means NoBOM
Get-Content -Path "$rootFolder\*.txt" -Encoding UTF8 -Raw | ForEach-Object {
    [System.IO.File]::AppendAllText($outFile, $_, $utf8NoBom)
}

Method 2

# create a Utf8NoBOM encoding object
$utf8NoBom = New-Object System.Text.UTF8Encoding $false  # $false means NoBOM
Get-ChildItem -Path $rootFolder -Filter '*.txt' -File | ForEach-Object {
    [System.IO.File]::AppendAllLines($outFile, [string[]]($_ | Get-Content -Encoding UTF8), $utf8NoBom)
}

Method 3

# Create a StreamWriter object which by default writes Utf8 without a BOM.
$sw = New-Object System.IO.StreamWriter $outFile, $true  # $true is for Append
Get-ChildItem -Path $rootFolder -Filter '*.txt' -File | ForEach-Object {
    Get-Content -Path $_.FullName -Encoding UTF8 | ForEach-Object {
        $sw.WriteLine($_)
    }
}
$sw.Dispose()
Theo
  • 35,300
  • 7
  • 15
  • 27
1

PS 5 (gc) can't handle utf8 no bom input files without the -encoding parameter:

(gc -Encoding Utf8 *.txt) | out-file newfile.txt -encoding utf8
js2010
  • 13,551
  • 2
  • 28
  • 40