Windows 10 64-bit. Powershell 5.1 and 7rc.1
Use PowerShell to convert Microsoft Word documents to HTML 4 / 5 documents.
HTML 4 and 5 documents should be saved using the UTF-8 character encoding format. PowerShell less than version 6 default character encoding format is UTF-8-BOM. <meta http-equiv=Content-Type content="text/html; charset=windows-1252">
has nothing to do with what character encoding the document is saved in.
You have at least three jobs:
- Replace
charset=windows-1252
with charset=UTF-8
- Save your documents using UTF-8 character encoding format.
- Check your output for errors.
Use your conversion script of choice. I like Thomas Stensitzki's Convert-WordDocument.ps1 for converting word documents with powershell. Like most conversion scripts it requires Apache OpenOffice ~v4.1.7 or ~ Microsoft Word 12? (Thomas says Word 16) be installed locally. It converts a 5MB Word 2003 document with 16 images to html in under twelve seconds.
Change your http-equiv
meta element if necessary:
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">`
to
<meta http-equiv=Content-Type content="text/html; charset=UTF-8"> for HTML 4 documents
or
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
to
<meta charset="UTF-8"> for HTML 5 documents.
A sitemap I created 012420 at xml-sitemaps.com used both.
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
Save / Create the document using the UTF-8 character encoding format.
What works in Powershell 5.1 might be easier in PowerShell 6 or >. Read the links below. Later versions of PowerShell default to UTF-8 character encoding format.
Powershell 5.1:
# without overwriting. UTF-8 character encoding format.
$source = (gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8"
$output = "$env:userprofile\Desktop\output.html"
[IO.File]::WriteAllLines($output, $source)
PowerShell 7rc.1
# without overwriting. UTF-8 character encoding format.
(gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8" | out-file -force $env:userprofile\Desktop\output.html
# with overwriting. UTF-8 character encoding format.
(gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8" | out-file -force $env:userprofile\Desktop\source.html
Batch convert with PowerShell 7rc.1:
# with overwriting. UTF-8 character encoding format.
foreach ($i in ls -name "$env:userprofile\Desktop\*.html")
{
(gc "$env:userprofile\Desktop\$i") -replace "charset=windows-1252", "charset=UTF-8" | out-file -force "$env:userprofile\Desktop\$i"
}
That should display your special characters correctly.
Understanding file encoding
HTML Charset - W3Schools
Declaring character encodings in HTML
HTML http-equiv Attribute
Using PowerShell to write a file in UTF-8 without the BOM
Understanding file encoding2
Understand default encoding and change the same in PowerShell
What version of powershell do you have $PSVersionTable.PSVersion
Always declare the encoding of your document using a meta element with a charset attribute. The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag. How to find the first 1024 bytes of a .html file in Windows 10 64-bit? Download http://unxutils.sourceforge.net/UnxUpdates.zip and use head -c 1024 myfilenamehere.html
None of the following worked but they should be read.
Changing PowerShell's default output encoding to UTF-8
Changing source files encoding and some fun with PowerShell
Convert Word documents using PowerShell
How to convert a word document to other formats using PowerShell
Saving Word document as HTML
Convert word document to text file using powershell