Convert Word doc and docx format to PDF in .NET Core without Microsoft.Office.Interop

Question

I need to display Word .doc and .docx files in a browser. There's no real client-side way to do this and these documents can't be shared with Google docs or Microsoft Office 365 for legal reasons.

Browsers can't display Word, but can display PDF, so I want to convert these docs to PDF on the server and then display that.

I know this can be done using Microsoft.Office.Interop.Word, but my application is .NET Core and does not have access to Office interop. It could be running on Azure, but it could also be running in a Docker container on anything else.

There appear to be lots of similar questions to this, however most are asking about full- framework .NET or assuming that the server is a Windows OS and any answer is no use to me.

How do I convert .doc and .docx files to .pdf without access to Microsoft.Office.Interop.Word?

It's like asking to convert from Word to PDF without the help of Microsoft. It's theoratically possible, but Word is such a huge application, that in the general case, it's practically impossible, Word is still the best for this. You could connect your core apps to an opaque dedicated Windows box exposing a conversion service (don't overlook licensing issues). Otherwise, if you restrict your conversion ambitions, there are some libraries that should help (aspose, itextsharp, etc.). Also, keep in mind that doc and docx are fundamentally very different formats and solutions may vary accordingly. — Simon Mourier, Oct 09 '17 at 08:19
@SimonMourier `docx` is (supposedly) an open format (Microsoft pushed for ages on that) but it is fairly awful - under the hood it's just a load of xml files in a zip. `doc` is binary, but also pretty much unchanged for 20 years and lots of parsers for the format are already out there. Office has always been a desktop app and an expensive liability on servers, I can't be the first/only person to ask for this. — Keith, Oct 09 '17 at 08:34
@SimonMourier I've used Aspose before - my team wasn't that impressed, it's crushingly expensive for what it does and it's full fat .NET, so no use here anyway. iText is good for PDF manipulation, but it's also expensive when there are plenty of PDF API that are open source. — Keith, Oct 09 '17 at 08:53
Well, looks like you have all the answers already; Indeed, you're not the only one looking for the Holy Grail :-) — Simon Mourier, Oct 09 '17 at 09:32
@SimonMourier I wouldn't have put a 500 bounty on it if I hadn't thought it was a damn nasty problem :-) — Keith, Oct 09 '17 at 09:33
I don't really get the problem. There are a lot of open source implementation for these formats. You could for example get a libreoffice binary and run `soffice --convert-to pdf --nologo name.docx` and you would have a pdf file. — Shmuel H., Oct 09 '17 at 11:59
@ShmuelH. I could indeed - you make it _sound_ easy. Why not go to the extra effort of putting that in an answer; you can't earn rep or bounties for comments. — Keith, Oct 12 '17 at 06:42
@Keith I wanted to make sure I'm not missing something. Thanks. — Shmuel H., Oct 12 '17 at 08:18
libreoffice is terrible for accuracy. its only until customer complains do you realise this fact. i have seen a dozen converters and they all fail compared to just saving as pdf in word. the thousands of dollars 3rd parties charge is pathetic for the accuracy they give. completely unuseable. — Luke, Jul 04 '20 at 15:13

score 98 · Accepted Answer · edited Aug 22 '20 at 14:12

This was such a pain, no wonder all the third party solutions are charging $500 per developer.

Good news is the Open XML SDK recently added support for .Net Standard so it looks like you're in luck with the .docx format.

Bad news at the moment there isn't a lot of choice for PDF generation libraries on .NET Core. Since it doesn't look like you want to pay for one and you can't legally use a third party service we have little choice except to roll our own.

The main problem is getting the Word Document Content transformed to PDF. One of the popular ways is reading the Docx into HTML and exporting that to PDF. It was hard to find, but there is .Net Core version of the OpenXMLSDK-PowerTools that supports transforming Docx to HTML. The Pull Request is "about to be accepted", you can get it from here:

https://github.com/OfficeDev/Open-Xml-PowerTools/tree/abfbaac510d0d60e2f492503c60ef897247716cf

Now that we can extract document content to HTML we need to convert it to PDF. There are a few libraries to convert HTML to PDF, for example DinkToPdf is a cross-platform wrapper around the Webkit HTML to PDF library libwkhtmltox.

I thought DinkToPdf was better than https://code.msdn.microsoft.com/How-to-export-HTML-to-PDF-c5afd0ce

Docx to HTML

Let's put this altogether, download the OpenXMLSDK-PowerTools .Net Core project and build it (just the OpenXMLPowerTools.Core and the OpenXMLPowerTools.Core.Example - ignore the other project). Set the OpenXMLPowerTools.Core.Example as StartUp project. Run the console project:

static void Main(string[] args)
{
    var source = Package.Open(@"test.docx");
    var document = WordprocessingDocument.Open(source);
    HtmlConverterSettings settings = new HtmlConverterSettings();
    XElement html = HtmlConverter.ConvertToHtml(document, settings);

    Console.WriteLine(html.ToString());
    var writer = File.CreateText("test.html");
    writer.WriteLine(html.ToString());
    writer.Dispose();
    Console.ReadLine();

Make sure the test.docx is a valid word document with some text otherwise you might get an error:

the specified package is invalid. the main part is missing

If you run the project you will see the HTML looks almost exactly like the content in the Word document:

However if you try a Word Document with pictures or links you will notice they're missing or broken.

This CodeProject article addresses these issues: https://www.codeproject.com/Articles/1162184/Csharp-Docx-to-HTML-to-Docx

I had to change the static Uri FixUri(string brokenUri) method to return a Uri and I added user friendly error messages.

static void Main(string[] args)
{
    var fileInfo = new FileInfo(@"c:\temp\MyDocWithImages.docx");
    string fullFilePath = fileInfo.FullName;
    string htmlText = string.Empty;
    try
    {
        htmlText = ParseDOCX(fileInfo);
    }
    catch (OpenXmlPackageException e)
    {
        if (e.ToString().Contains("Invalid Hyperlink"))
        {
            using (FileStream fs = new FileStream(fullFilePath,FileMode.OpenOrCreate, FileAccess.ReadWrite))
            {
                UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
            }
            htmlText = ParseDOCX(fileInfo);
        }
    }

    var writer = File.CreateText("test1.html");
    writer.WriteLine(htmlText.ToString());
    writer.Dispose();
}
        
public static Uri FixUri(string brokenUri)
{
    string newURI = string.Empty;
    if (brokenUri.Contains("mailto:"))
    {
        int mailToCount = "mailto:".Length;
        brokenUri = brokenUri.Remove(0, mailToCount);
        newURI = brokenUri;
    }
    else
    {
        newURI = " ";
    }
    return new Uri(newURI);
}

public static string ParseDOCX(FileInfo fileInfo)
{
    try
    {
        byte[] byteArray = File.ReadAllBytes(fileInfo.FullName);
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(byteArray, 0, byteArray.Length);
            using (WordprocessingDocument wDoc =
                                        WordprocessingDocument.Open(memoryStream, true))
            {
                int imageCounter = 0;
                var pageTitle = fileInfo.FullName;
                var part = wDoc.CoreFilePropertiesPart;
                if (part != null)
                    pageTitle = (string)part.GetXDocument()
                                            .Descendants(DC.title)
                                            .FirstOrDefault() ?? fileInfo.FullName;

                WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                {
                    AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                    PageTitle = pageTitle,
                    FabricateCssClasses = true,
                    CssClassPrefix = "pt-",
                    RestrictToSupportedLanguages = false,
                    RestrictToSupportedNumberingFormats = false,
                    ImageHandler = imageInfo =>
                    {
                        ++imageCounter;
                        string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                        ImageFormat imageFormat = null;
                        if (extension == "png") imageFormat = ImageFormat.Png;
                        else if (extension == "gif") imageFormat = ImageFormat.Gif;
                        else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                        else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                        else if (extension == "tiff")
                        {
                            extension = "gif";
                            imageFormat = ImageFormat.Gif;
                        }
                        else if (extension == "x-wmf")
                        {
                            extension = "wmf";
                            imageFormat = ImageFormat.Wmf;
                        }

                        if (imageFormat == null) return null;

                        string base64 = null;
                        try
                        {
                            using (MemoryStream ms = new MemoryStream())
                            {
                                imageInfo.Bitmap.Save(ms, imageFormat);
                                var ba = ms.ToArray();
                                base64 = System.Convert.ToBase64String(ba);
                            }
                        }
                        catch (System.Runtime.InteropServices.ExternalException)
                        { return null; }

                        ImageFormat format = imageInfo.Bitmap.RawFormat;
                        ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
                                                    .First(c => c.FormatID == format.Guid);
                        string mimeType = codec.MimeType;

                        string imageSource =
                                string.Format("data:{0};base64,{1}", mimeType, base64);

                        XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                        return img;
                    }
                };

                XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
                var html = new XDocument(new XDocumentType("html", null, null, null),
                                                                            htmlElement);
                var htmlString = html.ToString(SaveOptions.DisableFormatting);
                return htmlString;
            }
        }
    }
    catch
    {
        return "The file is either open, please close it or contains corrupt data";
    }
}

You may need System.Drawing.Common NuGet package to use ImageFormat

Now we can get images:

If you only want to show Word .docx files in a web browser its better not to convert the HTML to PDF as that will significantly increase bandwidth. You could store the HTML in a file system, cloud, or in a dB using a VPP Technology.

HTML to PDF

Next thing we need to do is pass the HTML to DinkToPdf. Download the DinkToPdf (90 MB) solution. Build the solution - it will take a while for all the packages to be restored and for the solution to Compile.

IMPORTANT:

The DinkToPdf library requires the libwkhtmltox.so and libwkhtmltox.dll file in the root of your project if you want to run on Linux and Windows. There's also a libwkhtmltox.dylib file for Mac if you need it.

These DLLs are in the v0.12.4 folder. Depending on your PC, 32 or 64 bit, copy the 3 files to the DinkToPdf-master\DinkToPfd.TestConsoleApp\bin\Debug\netcoreapp1.1 folder.

IMPORTANT 2:

Make sure that you have libgdiplus installed in your Docker image or on your Linux machine. The libwkhtmltox.so library depends on it.

Set the DinkToPfd.TestConsoleApp as StartUp project and change the Program.cs file to read the htmlContent from the HTML file saved with Open-Xml-PowerTools instead of the Lorium Ipsom text.

var doc = new HtmlToPdfDocument()
{
    GlobalSettings = {
        ColorMode = ColorMode.Color,
        Orientation = Orientation.Landscape,
        PaperSize = PaperKind.A4,
    },
    Objects = {
        new ObjectSettings() {
            PagesCount = true,
            HtmlContent = File.ReadAllText(@"C:\TFS\Sandbox\Open-Xml-PowerTools-abfbaac510d0d60e2f492503c60ef897247716cf\ToolsTest\test1.html"),
            WebSettings = { DefaultEncoding = "utf-8" },
            HeaderSettings = { FontSize = 9, Right = "Page [page] of [toPage]", Line = true },
            FooterSettings = { FontSize = 9, Right = "Page [page] of [toPage]" }
        }
    }
};

The result of the Docx vs the PDF is quite impressive and I doubt many people would pick out many differences (especially if they never see the original):

Ps. I realise you wanted to convert both .doc and .docx to PDF. I'd suggest making a service yourself to convert .doc to docx using a specific non-server Windows/Microsoft technology. The doc format is binary and is not intended for server side automation of office.

Why not pipe it through an application that converts for you? — Mardoxx, Oct 10 '17 at 13:01
Cheers, excellent answer. I think I might have the last piece of the puzzle as I found an open source .NET Mono `doc` > `docx` converter that can be [ported to .NET Core](https://github.com/EvolutionJobs/b2xtranslator-core). — Keith, Oct 11 '17 at 06:27
Have you looked at 2nd Point in design consideration? https://github.com/OfficeDev/office-content/blob/master/en-us/OpenXMLCon/articles/43c49a6d-96b5-4e87-a5bf-01629d61aad4.md , I guess MSFT does not want you to use this for generating HTML or PDF — Akash Kava, Oct 12 '17 at 05:56
@AkashKava that's because Open XML SDK **PowerTools** provides the functionality to convert Open XML formats to and from other formats, such as HTML. — Jeremy Thompson, Oct 12 '17 at 06:07
@JeremyThompson I've gotten [`b2xtranslator`](https://github.com/EvolutionJobs/b2xtranslator-core) up and running in .NET Core, switched from the dedicated ZIP implementation to `System.IO.Compression` and fixed the weird command line tests to just use NUnit. It's still not quite there - working on getting all unit tests to pass and adding new to cover more use-cases/code. Looking for contributors if you (or anyone) are interested. — Keith, Oct 12 '17 at 06:19
How do you use ImageFormat, as System.Drawing is not supported in .net core? — Boris Lipschitz, May 22 '18 at 23:50
Sure. In your answer, you provided a function ParseDOCX() where you use ImageFormat. What library should I reference in order to use it? — Boris Lipschitz, May 23 '18 at 00:55
I'm having trouble with text boxes and "behind text" images. Its a one big mess. Text box are not showing at all — Guy Biber, Oct 17 '18 at 20:21
`.doc` to `.docx` can be converted using `Microsoft.Office.Interop.Word.dll` if you have Office installed on the server: https://stackoverflow.com/questions/34111015/convert-doc-to-docx-using-c-sharp — vapcguy, Nov 30 '18 at 23:31
@vapcguy - you didn't read the question. OP specifically can't install Office on Linux and KB257757 says office automation server side is unsupported. — Jeremy Thompson, Dec 01 '18 at 02:56
@JeremyThompson Yeah, you're right... :facepalm: Missed that part. — vapcguy, Dec 03 '18 at 15:54
I was getting exception converting docx to html. Changing line to `Package.Open(stream,FileMode.OpenOrCreate);` helped — alex kostin, Jul 15 '19 at 09:24
How are fonts handled? Specifically, fonts which aren't part of the standard PDF fonts. `HtmlConverter.ConvertToHtml` provides the font names, but I haven't been able to determine if it's possible to get the corresponding font file, as that will need to be embedded inside the PDF if it is not standard. — jfizz, Jul 19 '19 at 19:04
@jfizz you could try to load a CSS with Fonts however afaik you can't embed fonts in a PDF the client OS has to have the font installed. See https://wkhtmltopdf.org/libwkhtmltox/pagesettings.html — Jeremy Thompson, Jul 19 '19 at 22:36
@JeremyThompson embedding fonts in PDFs is possible. For example, with PDFium: https://github.com/ArgusMagnus/PDFiumSharp/wiki/M_PDFiumSharp_PDFium_FPDFText_LoadFont. It is also possible to embed in Word docs. However, there could possibly be licensing issues with embedding which might be why no font files are given. — jfizz, Jul 19 '19 at 22:48
@jfizz **`HtmlConverter.ConvertToHtml` provides the font names, but I haven't been able to determine if it's possible to get the corresponding font file, as that will need to be embedded inside the PDF if it is not standard.** - I'm interested in trying, maybe a new question? — Jeremy Thompson, Jul 20 '19 at 03:18
@BorisLipschitz I know this is old, but for the benefit of anyone else wondering, you install the System.Drawing.Common NuGet package to use ImageFormat — Quails4Eva, Sep 04 '19 at 11:44
@JeremyThompson Cool. Note, I think this is only required when targeting .Net Core, the namespace is probably already included in the full .Net — Quails4Eva, Sep 05 '19 at 08:39

Shmuel H. · Answer 2 · 2017-10-12T12:32:38.090

Using the LibreOffice binary

The LibreOffice project is a Open Source cross-platform alternative for MS Office. We can use its capabilities to export doc and docx files to PDF. Currently, LibreOffice has no official API for .NET, therefore, we will talk directly to the soffice binary.

It is a kind of a "hacky" solution, but I think it is the solution with less amount of bugs and maintaining costs possible. Another advantage of this method is that you are not restricted to converting from doc and docx: you can convert it from every format LibreOffice support (e.g. odt, html, spreadsheet, and more).

The implementation

I wrote a simple c# program that uses the soffice binary. This is just a proof-of-concept (and my first program in c#). It supports Windows out of the box and Linux only if the LibreOffice package has been installed.

This is main.cs:

using System;
using System.Collections.Generic;
using System.Text;
using System.Diagnostics;
using System.Reflection;

namespace DocToPdf
{
    public class LibreOfficeFailedException : Exception
    {
        public LibreOfficeFailedException(int exitCode)
            : base(string.Format("LibreOffice has failed with {}", exitCode))
            {}
    }

    class Program
    {
        static string getLibreOfficePath() {
            switch (Environment.OSVersion.Platform) {
                case PlatformID.Unix:
                    return "/usr/bin/soffice";
                case PlatformID.Win32NT:
                    string binaryDirectory = System.IO.Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location);
                    return binaryDirectory + "\\Windows\\program\\soffice.exe";
                default:
                    throw new PlatformNotSupportedException ("Your OS is not supported");
            }
        }

        static void Main(string[] args) {
            string libreOfficePath = getLibreOfficePath();

            // FIXME: file name escaping: I have not idea how to do it in .NET.
            ProcessStartInfo procStartInfo = new ProcessStartInfo(libreOfficePath, string.Format("--convert-to pdf --nologo {0}", args[0]));
            procStartInfo.RedirectStandardOutput = true;
            procStartInfo.UseShellExecute = false;
            procStartInfo.CreateNoWindow = true;
            procStartInfo.WorkingDirectory = Environment.CurrentDirectory;

            Process process = new Process() { StartInfo =      procStartInfo, };
            process.Start();
            process.WaitForExit();

            // Check for failed exit code.
            if (process.ExitCode != 0) {
                throw new LibreOfficeFailedException(process.ExitCode);
            }
        }
    }
}

Resources

The project repository: Example of a package including the Windows LibreOffice binary.

Results

I had tested it on Arch Linux, compiled with mono. I run it using mon and the Linux binary, and with wine: using the Windows binary.

You can find the results in the Tests directory:

Input files: testdoc.doc, testdocx.docx

Outputs:

Wine: testdoc, testdocx.
Mono: testdoc, testdocx.

Note that libreoffice will not be able to properly convert office documents that use proprietary fonts (had the Issue with verdana If I recall correctly) unless they are installed on the OS. Other than this font issue, didn't have much issue with it. — Herz3h, Dec 03 '20 at 10:52
Just be super careful with handling the filename if it is a user-input value. This can lead to code execution on your server. — Timothy Leung, Feb 27 '21 at 04:52

score 11 · Answer 3 · edited Jan 30 '20 at 21:55

11

I've recently done this with FreeSpire.Doc. It has a limit of 3 pages for the free version but it can easily convert a docx file into PDF using something like this:

private void ConvertToPdf()
{
    try
    {
        for (int i = 0; i < listOfDocx.Count; i++)
        {
            CurrentModalText = "Converting To PDF";
            CurrentLoadingNum += 1;

            string savePath = PdfTempStorage + i + ".pdf";
            listOfPDF.Add(savePath);

            Spire.Doc.Document document = new Spire.Doc.Document(listOfDocx[i], FileFormat.Auto);
            document.SaveToFile(savePath, FileFormat.PDF);
        }
    }
    catch (Exception e)
    {
        throw e;
    }
}

I then sew these individual PDFs together later using iTextSharp.pdf:

public static byte[] concatAndAddContent(List<byte[]> pdfByteContent, List<MailComm> localList)
{
    using (var ms = new MemoryStream())
    {
        using (var doc = new Document())
        {
            using (var copy = new PdfSmartCopy(doc, ms))
            {
                doc.Open();
                // add checklist at the start
                using (var db = new StudyContext())
                {
                    var contentId = localList[0].ContentID;
                    var temp = db.MailContentTypes.Where(x => x.ContentId == contentId).ToList();
                    if (!temp[0].Code.Equals("LAB"))
                    {
                        pdfByteContent.Insert(0, CheckListCreation.createCheckBox(localList));
                    }
                }

                // Loop through each byte array
                foreach (var p in pdfByteContent)
                {
                    // Create a PdfReader bound to that byte array
                    using (var reader = new PdfReader(p))
                    {
                        // Add the entire document instead of page-by-page
                        copy.AddDocument(reader);
                    }
                }

                doc.Close();
            }
        }

        // Return just before disposing
        return ms.ToArray();
    }
}

I don't know if this suits your use case, as you haven't specified the size of the documents you're trying to write, but if they're > 3 pages or you can manipulate them to be less than 3 pages, it will allow you to convert them into PDFs.

As mentioned in the comments below, it is also unable to help with RTL languages, thank you @Aria for pointing that out.

edited Jan 30 '20 at 21:55

CarenRose

1,142
11
18

answered Sep 01 '18 at 10:39

Bomie

111
2
5

6

Just to clarify because you didn't mention it. "Spire.Doc" leaves a red "warning evaluation" watermark at the top of the converted PDF. When searching on Nuget, look for "FreeSpire.Doc", this version does not contain the watermark. Nice API, this should be marked as the answer imo. – user3180664 Jan 07 '19 at 21:34
Yeah, that's what I did, sorry i should of been more specific. Hopefully this answer helped you out a little! – Bomie Jan 08 '19 at 08:03
I'm using FreeSpire.Doc and still getting the eval warning. – grinder22 Feb 27 '19 at 01:09
There are some problems with free that I recently tested, 1- Main problem is about RTL document such as Persian, all my characters messed up and unreadable. 2- There is a signature in end of document. 3- It is too slow. – Aria Aug 13 '19 at 11:10
@Aria can you clarify what you mean by a signature? Are you referring to the warning if you go over 3 pages as i did mention that. I limit the documents i'm working with to 2 pages then sew them together when they need to be displayed to the user. – Bomie Aug 13 '19 at 11:53
@Bomie , Yes I mean that something like **Free version converting word documents to PDF files, you can only get the first 3 page of PDF file.**, but that is no problem for me because all my documents are less than 3 pages, the problem is messing up all my character(RTL), is there workaround to resolve the first ? – Aria Aug 13 '19 at 12:14
@Aria I'm unsure if the characters are what's causing the issue or the RTL for you, but i've just used words align right and it seems to work fine as does copy and pasting some persian characters ا ب پ ت ث ج چ ح خ د ذ ر ز ژ س ش صض ط ظ ع غ ف ق ک گ ل م ن و ه ی which I found. I'd recommend trying it with a fresh document and also checking the word documents xml. There are some very useful programs floating around that allow you to do that – Bomie Aug 13 '19 at 12:33
@Bomie You know all my words used align right already and main document have RTL direction, your case worked because your document just contains character not words, so something like سلام will be س ل ا م, and if a line contains English words whole words and chars start LTR completely, anyway thanks for your advise and attentions. – Aria Aug 13 '19 at 12:57
@Aria So i genned a doc file in the actual langauge, I didnt fully understand what you meant so I apologise, but i was able to re-create the issue. It seems like it's a known issue on their forums ([link](https://www.e-iceblue.com/forum/problem-with-converting-word-documents-to-pdf-documents-t7901.html)). – Bomie Aug 13 '19 at 13:36
@Bomie Yes I see, I think the issue isn't fixed yet, There are some another library like `GemBox.Document` that can't convert RTL document properly, I tried to use [Word2Pdf.dll](https://www.c-sharpcorner.com/uploadfile/698727/convert-word-file-to-pdf-using-c-sharp/) it does work great on local but on server throw an exception as comments mentioned there, I tried to use `Microsoft.Office.Interop.Word` this is also need installed Office Word unfortunately. – Aria Aug 13 '19 at 14:06
@Aria These days `GemBox.Document` supports converting RTL documents to PDF, see [Right-to-Left Text](https://www.gemboxsoftware.com/document/examples/right-to-left-text/107) example. – Mario Z May 18 '20 at 02:19
1

@MarioZ Thanks for your comment, I already resolved the problem by installing Microsoft Word for those servers they need this feature but in future we may change it, I was enjoyed to have conversation with you. thank you for providing useful link. – Aria May 18 '20 at 08:20

score 3 · Answer 4 · answered Aug 07 '19 at 07:32

Sorry I don't have enough reputation to comment but would like to put my two cents on Jeremy Thompson's answer. And hope this help someone.

When I was going through Jeremy Thompson's answer, after downloading OpenXMLSDK-PowerTools and run OpenXMLPowerTools.Core.Example, I got error like

the specified package is invalid. the main part is missing

at the line

var document = WordprocessingDocument.Open(source);

After struggling for some hours, I found that the test.docx copied to bin file is only 1kb. To solve this, right click test.docx > Properties, set Copy to Output Directory to Copy always solves this problem.

Hope this help some novice like me :)

HappyGoLucky · Answer 5 · 2020-07-09T18:16:21.273

This is adding to Jeremy Thompson's very helpful answer. In addition to the word document body, I wanted the header (and footer) of the word document converted to HTML. I didn't want to modify the Open-Xml-PowerTools so I modified Main() and ParseDOCX() from Jeremy's example, and added two new functions. ParseDOCX now accepts a byte array so the original Word Docx isn't modified.

static void Main(string[] args)
{
    var fileInfo = new FileInfo(@"c:\temp\MyDocWithImages.docx");
    byte[] fileBytes = File.ReadAllBytes(fileInfo.FullName);
    string htmlText = string.Empty;
    string htmlHeader = string.Empty;
    try
    {
        htmlText = ParseDOCX(fileBytes, fileInfo.Name, false);
        htmlHeader = ParseDOCX(fileBytes, fileInfo.Name, true);
    }
    catch (OpenXmlPackageException e)
    {
        if (e.ToString().Contains("Invalid Hyperlink"))
        {
            using (FileStream fs = new FileStream(fullFilePath, FileMode.OpenOrCreate, FileAccess.ReadWrite))
            {
                UriFixer.FixInvalidUri(fs, brokenUri => FixUri(brokenUri));
            }
            htmlText = ParseDOCX(fileBytes, fileInfo.Name, false);
            htmlHeader = ParseDOCX(fileBytes, fileInfo.Name, true);
        }
    }

    var writer = File.CreateText("test1.html");
    writer.WriteLine(htmlText.ToString());
    writer.Dispose();
    var writer2 = File.CreateText("header1.html");
    writer2.WriteLine(htmlHeader.ToString());
    writer2.Dispose();
}

private static string ParseDOCX(byte[] fileBytes, string filename, bool headerOnly)
{
    try
    {
        using (MemoryStream memoryStream = new MemoryStream())
        {
            memoryStream.Write(fileBytes, 0, fileBytes.Length);
            using (WordprocessingDocument wDoc = WordprocessingDocument.Open(memoryStream, true))
            {
                int imageCounter = 0;
                var pageTitle = filename;
                var part = wDoc.CoreFilePropertiesPart;
                if (part != null)
                {
                    pageTitle = (string)part.GetXDocument()
                                            .Descendants(DC.title)
                                            .FirstOrDefault() ?? filename;
                }

                WmlToHtmlConverterSettings settings = new WmlToHtmlConverterSettings()
                {
                    AdditionalCss = "body { margin: 1cm auto; max-width: 20cm; padding: 0; }",
                    PageTitle = pageTitle,
                    FabricateCssClasses = true,
                    CssClassPrefix = "pt-",
                    RestrictToSupportedLanguages = false,
                    RestrictToSupportedNumberingFormats = false,
                    ImageHandler = imageInfo =>
                    {
                        ++imageCounter;
                        string extension = imageInfo.ContentType.Split('/')[1].ToLower();
                        ImageFormat imageFormat = null;
                        if (extension == "png") imageFormat = ImageFormat.Png;
                        else if (extension == "gif") imageFormat = ImageFormat.Gif;
                        else if (extension == "bmp") imageFormat = ImageFormat.Bmp;
                        else if (extension == "jpeg") imageFormat = ImageFormat.Jpeg;
                        else if (extension == "tiff")
                        {
                            extension = "gif";
                            imageFormat = ImageFormat.Gif;
                        }
                        else if (extension == "x-wmf")
                        {
                            extension = "wmf";
                            imageFormat = ImageFormat.Wmf;
                        }

                        if (imageFormat == null) return null;

                        string base64 = null;
                        try
                        {
                            using (MemoryStream ms = new MemoryStream())
                            {
                                imageInfo.Bitmap.Save(ms, imageFormat);
                                var ba = ms.ToArray();
                                base64 = System.Convert.ToBase64String(ba);
                            }
                        }
                        catch (System.Runtime.InteropServices.ExternalException)
                        { return null; }

                        ImageFormat format = imageInfo.Bitmap.RawFormat;
                        ImageCodecInfo codec = ImageCodecInfo.GetImageDecoders()
                                                    .First(c => c.FormatID == format.Guid);
                        string mimeType = codec.MimeType;

                        string imageSource =
                                string.Format("data:{0};base64,{1}", mimeType, base64);

                        XElement img = new XElement(Xhtml.img,
                                new XAttribute(NoNamespace.src, imageSource),
                                imageInfo.ImgStyleAttribute,
                                imageInfo.AltText != null ?
                                    new XAttribute(NoNamespace.alt, imageInfo.AltText) : null);
                        return img;
                    }
                };

                // Put header into document body, and remove everything else
                if (headerOnly)
                {
                    MoveHeaderToDocumentBody(wDoc);
                }

                XElement htmlElement = WmlToHtmlConverter.ConvertToHtml(wDoc, settings);
                var html = new XDocument(new XDocumentType("html", null, null, null),
                                                                            htmlElement);
                var htmlString = html.ToString(SaveOptions.DisableFormatting);
                return htmlString;
            }
        }
    }
    catch
    {
        return "The file is either open, please close it or contains corrupt data";
    }
}

private static void MoveHeaderToDocumentBody(WordprocessingDocument wDoc)
{
    MainDocumentPart mainDocument = wDoc.MainDocumentPart;
    XElement docRoot = mainDocument.GetXDocument().Root;
    XElement body = docRoot.Descendants(W.body).First();
    // Only handles first header. Header info: https://docs.microsoft.com/en-us/office/open-xml/how-to-replace-the-header-in-a-word-processing-document
    HeaderPart header = mainDocument.HeaderParts.FirstOrDefault();
    XElement headerRoot = header.GetXDocument().Root;

    AddXElementToBody(headerRoot, body);

    // document body will have new headers when we return from this function
    return;
}

private static void AddXElementToBody(XElement sourceElement, XElement body)
{
    // Clone the children nodes
    List<XElement> children = sourceElement.Elements().ToList();
    List<XElement> childClones = children.Select(el => new XElement(el)).ToList();

    // Clone the section properties nodes
    List<XElement> sections = body.Descendants(W.sectPr).ToList();
    List<XElement> sectionsClones = sections.Select(el => new XElement(el)).ToList();

    // clear body
    body.Descendants().Remove();

    // add source elements to body
    foreach (var child in childClones)
    {
        body.Add(child);
    }

    // add section properties to body
    foreach (var section in sectionsClones)
    {
        body.Add(section);
    }

    // get text from alternate content if needed - either choice or fallback node
    XElement alternate = body.Descendants(MC.AlternateContent).FirstOrDefault();
    if (alternate != null)
    {
        var choice = alternate.Descendants(MC.Choice).FirstOrDefault();
        var fallback = alternate.Descendants(MC.Fallback).FirstOrDefault();
        if (choice != null)
        {
            var choiceChildren = choice.Elements();
            foreach(var choiceChild in choiceChildren)
            {
                body.Add(choiceChild);
            }
        }
        else if (fallback != null)
        {
            var fallbackChildren = fallback.Elements();
            foreach (var fallbackChild in fallbackChildren)
            {
                body.Add(fallbackChild);
            }
        }
    }
}

You could add similar methods to handle the Word document footer.

In my case, I then convert the HTML files to images (using Net-Core-Html-To-Image, also based on wkHtmlToX). I combine the header and body images together (using Magick.NET-Q16-AnyCpu), placing the header image at the top of the body image.

Smart In Media · Answer 6 · 2019-09-26T11:37:20.873

For converting DOCX to PDF even with placeholders, I have created a free "Report-From-DocX-HTML-To-PDF-Converter" library with .NET CORE under the MIT license, because I was so unnerved that no simple solution existed and all the commercial solutions were super expensive. You can find it here with an extensive description and an example project:

https://github.com/smartinmedia/Net-Core-DocX-HTML-To-PDF-Converter

You only need the free LibreOffice. I recommend using the LibreOffice portable edition, so it does not change anything in your server settings. Have a look, where the file "soffice.exe" (on Linux it is called differently) located, because you need it to fill the variable "locationOfLibreOfficeSoffice".

Here is how it works to convert from DOCX to HTML:

string locationOfLibreOfficeSoffice =   @"C:\PortableApps\LibreOfficePortable\App\libreoffice\program\soffice.exe";

var docxLocation = "MyWordDocument.docx";

var rep = new ReportGenerator(locationOfLibreOfficeSoffice);

//Convert from DOCX to PDF
test.Convert(docxLocation, Path.Combine(Path.GetDirectoryName(docxLocation), "Test-Template-out.pdf"));


//Convert from DOCX to HTML
test.Convert(docxLocation, Path.Combine(Path.GetDirectoryName(docxLocation), "Test-Template-out.html"));

As you see, you can also convert from DOCX to HTML. Also, you can put placeholders into the Word document, which you can then "fill" with values. However, this is not in the scope of your question, but you can read about that on Github (README).

I have a couple of questions: 1. Are there any known issues when used in production for avg load of 10 docx to pdf conversions per minute? 2. The portable libreoffice is about 1 GB. Can you indicate which folders / files can be removed to make it lighter without affecting the functionality? — Ravi M Patel, Apr 14 '20 at 12:40
It is also taking more than 10 seconds to do the conversion. Is it normal? — Ravi M Patel, Apr 14 '20 at 14:32

score 0 · Answer 7 · answered Apr 26 '21 at 13:07

An alternate solution could be implemented if you have access to office 365. This has less limitations than my previous answer but requires that purchase.

I get a graph API token, the site I'm wanting to work with and the drive I'm wanting to use.

After that i grab the byte array of the docx

    public static async Task<Stream> GetByteArrayOfDocumentAsync(string baseFilePathLocation)
    {
        var byteArray = File.ReadAllBytes(baseFilePathLocation);
        using var stream = new MemoryStream();
        stream.Write(byteArray, 0, (int) byteArray.Length);

        return stream;
    }

This stream is then uploaded to the graph api using a client setup with our graph api token via

        public static async Task<string> UploadFileAsync(HttpClient client,
                                                     string siteId,
                                                     MemoryStream stream,
                                                     string driveId,
                                                     string fileName,
                                                     string folderName = "root")
    {

        var result = await client.PutAsync(
            $"https://graph.microsoft.com/v1.0/sites/{siteId}/drives/{driveId}/items/{folderName}:/{fileName}:/content",
            new ByteArrayContent(stream.ToArray()));
        var res = JsonSerializer.Deserialize<SharepointDocument>(await result.Content.ReadAsStringAsync());
        return res.id;
    }

We then download from graph api using that api given to get a PDF via

        public static async Task<Stream> GetPdfOfDocumentAsync(HttpClient client,
                                                            string siteId,
                                                            string driveId,
                                                            string documentId)
    {


        var getRequest =
            await client.GetAsync(
                $"https://graph.microsoft.com/v1.0/sites/{siteId}/drives/{driveId}/items/{documentId}/content?format=pdf");
        return await getRequest.Content.ReadAsStreamAsync();

    }

This gives a stream composed off the document that was just created.

score -3 · Answer 8 · answered Oct 27 '19 at 19:05

-3

I know this can be done using Microsoft.Office.Interop.Word, but my application is .NET Core and does not have access to Office interop.

Maybe this is not true? You CAN load assemblies in dotnet core, however, loading interop components may be a challenge since dotnet core is host agnostic.

Here is the thing though you don't need to install Office to obtain the Office primary interop assemblies. You can try loading the assemblies without using COM+ though this maybe a bit tricky? I'm actually not sure if this can be done, but I think in theory you should be able to do it. Has anyone thought to try this without installing office?

Here is the link to office PIA https://www.microsoft.com/en-us/download/confirmation.aspx?id=3508

answered Oct 27 '19 at 19:05

truedeity

1

1

Did you try what you are recommending. It seems to me that you are answering by a question. If you want your answer to matter please try and evaluate your answer this will benefit everybody. Basically your answer as I read it is more a question like "can interop components be loaded in dotnet core ?", which in itself is a good question see http://joelleach.net/2018/06/06/com-interop-with-net-core-2-0/ – PilouPili Oct 27 '19 at 19:24
Thanks for the additional questions, but I already know the answers: No, it is true, You can't load assemblies not on the machine (or in your distribution). Yes, forget about interop. I know, those assemblies aren't available either. Loading via COM+ is not a _bit tricky_, it's impossible in the context I outlined in the question. Yes, in theory you could set up a virtual instance of Windows that runs all the COM stuff and access that, but that's a much more expensive set up. Yes, we've thought about doing this without installing Office, that's the question. – Keith Oct 28 '19 at 08:52