1

Using the following PowerShell script, I am converting a directory of Word documents to HTML.

$wdTypes = Add-Type -AssemblyName 'Microsoft.Office.Interop.Word' -Passthru
[void][System.Reflection.Assembly]::LoadWithPartialName('Microsoft.Office.Interop.Word.WdSaveFormat')
$docSrc = "C:\Users\Me\Desktop\TestWordDocs"
$htmlOutputPath = "C:\Users\Me\Desktop\TestHTMLDocs"
$srcFiles = Get-ChildItem $docSrc -filter "*.doc"
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatHTML");
$wordApp = new-object -comobject word.application
$wordApp.Visible = $false

function saveashtml {
  $openDoc = $wordApp.documents.open($doc.FullName);
  $openDoc.saveas([ref]"$htmlOutputPath\$doc.fullname.html", [ref]$saveFormat);
  $openDoc.close();
}

ForEach ($doc in $srcFiles) {
  Write-Host "Converting to html :" $doc.FullName
  saveashtml
  $doc = $null
}

$wordApp.quit();

This successfully converts the file but not in UTF-8 format as seen in the meta tag.

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

Special characters are displayed as � in the HTML file.

How can I fix this?

cfoster5
  • 1,326
  • 2
  • 17
  • 34
  • what os? what version powershell? what version word? what browser? what browser version? Put the answers to these questions in your question. – somebadhat Jan 25 '20 at 00:30

1 Answers1

1

Windows 10 64-bit. Powershell 5.1 and 7rc.1

Use PowerShell to convert Microsoft Word documents to HTML 4 / 5 documents.

HTML 4 and 5 documents should be saved using the UTF-8 character encoding format. PowerShell less than version 6 default character encoding format is UTF-8-BOM. <meta http-equiv=Content-Type content="text/html; charset=windows-1252"> has nothing to do with what character encoding the document is saved in.

You have at least three jobs:

  1. Replace charset=windows-1252 with charset=UTF-8
  2. Save your documents using UTF-8 character encoding format.
  3. Check your output for errors.

Use your conversion script of choice. I like Thomas Stensitzki's Convert-WordDocument.ps1 for converting word documents with powershell. Like most conversion scripts it requires Apache OpenOffice ~v4.1.7 or ~ Microsoft Word 12? (Thomas says Word 16) be installed locally. It converts a 5MB Word 2003 document with 16 images to html in under twelve seconds.

Change your http-equiv meta element if necessary:

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">` 

to

<meta http-equiv=Content-Type content="text/html; charset=UTF-8"> for HTML 4 documents 

or

<meta http-equiv=Content-Type content="text/html; charset=windows-1252">

to

<meta charset="UTF-8"> for HTML 5 documents.

A sitemap I created 012420 at xml-sitemaps.com used both.

<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta charset="utf-8">

Save / Create the document using the UTF-8 character encoding format.

What works in Powershell 5.1 might be easier in PowerShell 6 or >. Read the links below. Later versions of PowerShell default to UTF-8 character encoding format.

Powershell 5.1:

# without overwriting. UTF-8 character encoding format.
$source = (gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8"
$output = "$env:userprofile\Desktop\output.html"
[IO.File]::WriteAllLines($output, $source)

PowerShell 7rc.1

# without overwriting. UTF-8 character encoding format.
(gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8" | out-file -force $env:userprofile\Desktop\output.html
# with overwriting. UTF-8 character encoding format.
(gc $env:userprofile\Desktop\source.html) -replace "charset=windows-1252", "charset=UTF-8" | out-file -force $env:userprofile\Desktop\source.html

Batch convert with PowerShell 7rc.1:

# with overwriting. UTF-8 character encoding format.
foreach ($i in ls -name "$env:userprofile\Desktop\*.html")
{
    (gc "$env:userprofile\Desktop\$i") -replace "charset=windows-1252", "charset=UTF-8" | out-file -force "$env:userprofile\Desktop\$i"
}

That should display your special characters correctly.

Understanding file encoding

HTML Charset - W3Schools

Declaring character encodings in HTML

HTML http-equiv Attribute

Using PowerShell to write a file in UTF-8 without the BOM

Understanding file encoding2

Understand default encoding and change the same in PowerShell

What version of powershell do you have $PSVersionTable.PSVersion

Always declare the encoding of your document using a meta element with a charset attribute. The declaration should fit completely within the first 1024 bytes at the start of the file, so it's best to put it immediately after the opening head tag. How to find the first 1024 bytes of a .html file in Windows 10 64-bit? Download http://unxutils.sourceforge.net/UnxUpdates.zip and use head -c 1024 myfilenamehere.html

None of the following worked but they should be read.

Changing PowerShell's default output encoding to UTF-8

Changing source files encoding and some fun with PowerShell

Convert Word documents using PowerShell

How to convert a word document to other formats using PowerShell

Saving Word document as HTML

Convert word document to text file using powershell

somebadhat
  • 714
  • 1
  • 5
  • 16