16

I have few word files that each have specific content. I would like for a snippet that show me or help me to figure out how to combine the word files into one file, while using Python docx library.

For example in pywin32 library I did the following:

rng = self.doc.Range(0, 0)
for d in data:
    time.sleep(0.05)

    docstart = d.wordDoc.Content.Start
    self.word.Visible = True
    docend = d.wordDoc.Content.End - 1
    location = d.wordDoc.Range(docstart, docend).Copy()
    rng.Paste()
    rng.Collapse(0)
    rng.InsertBreak(win32.constants.wdPageBreak)

But I need to do it while using Python docx library instead of win32.client

SiHa
  • 5,520
  • 12
  • 23
  • 37
omri_saadon
  • 8,823
  • 5
  • 25
  • 53
  • i wrote the question again @abarnert – omri_saadon Jul 21 '14 at 19:08
  • 2
    The question as re-written looks very answerable. Thank you @omri_saadon – Adam Smith Jul 21 '14 at 19:09
  • 1
    @AdamSmith: Answerable, yes, but now he's asking us to port his code from one library to another, which still isn't appropriate for SO. Especially since he hasn't shown any of his docx code, or described how far he's gotten and where he's stuck except in the vaguest terms. – abarnert Jul 21 '14 at 19:26
  • i don't know how to do it, my idea was to run over each document (run over paragraphs and tables) and copy it somehow to the new word file. even if you have a general idea of how to do it i'l be more then glad. i am familiar with this library for few days. @abarnert – – omri_saadon Jul 21 '14 at 20:04

7 Answers7

18

The alternative approach to merge two documents including all the styles is to use python library docxcompose ( https://pypi.org/project/docxcompose/) . We do not need to explicitly define the styling and we do not have to read the document paragraph by paragraph and append it to the master document. The usage of the python docxcompose is shown in the below code

#Importing the required packages

from docxcompose.composer import Composer
from docx import Document as Document_compose
#filename_master is name of the file you want to merge the docx file into
master = Document_compose(filename_master)

composer = Composer(master)
#filename_second_docx is the name of the second docx file
doc2 = Document_compose(filename_second_docx)
#append the doc2 into the master using composer.append function
composer.append(doc2)
#Save the combined docx with a name
composer.save("combined.docx")

If you want to merge multiple documents into one docx file you can use the below function


#Filename_master is the name of the file you want to merge all the document into
#files_list is a list containing all the filename of the docx file to be merged
def combine_all_docx(filename_master,files_list):
    number_of_sections=len(files_list)
    master = Document_compose(filename_master)
    composer = Composer(master)
    for i in range(0, number_of_sections):
        doc_temp = Document_compose(files_list[i])
        composer.append(doc_temp)
    composer.save("combined_file.docx")
#For Example
#filename_master="file1.docx"
#files_list=["file2.docx","file3.docx","file4.docx",file5.docx"]
#Calling the function
#combine_all_docx(filename_master,files_list)
#This function will combine all the document in the array files_list into the file1.docx and save the merged document into combined_file.docx
17

I've adjusted the example above to work with the latest version of python-docx (0.8.6 at the time of writing). Note that this just copies the elements (merging styles of elements is more complicated to do):

from docx import Document

files = ['file1.docx', 'file2.docx']

def combine_word_documents(files):
    merged_document = Document()

    for index, file in enumerate(files):
        sub_doc = Document(file)

        # Don't add a page break if you've reached the last file.
        if index < len(files)-1:
           sub_doc.add_page_break()

        for element in sub_doc.element.body:
            merged_document.element.body.append(element)

    merged_document.save('merged.docx')

combine_word_documents(files)
maerteijn
  • 363
  • 4
  • 8
  • 1
    Yes, but still relevant :) – maerteijn Nov 08 '16 at 15:57
  • 2
    This was very usefull, thanks. In my case I had a lot of custom styling to deal with (but which was the same for all documents) so found it easier to use the first document in the list as the `merged_document` and then append all the others to it. That way there are no conflicts in styling with the default template, which is what `Document()` uses by default. – Mr Kriss Jun 14 '17 at 12:19
  • Good solution. Note that `append` in `python-docx` logic means CUT and paste, not COPY and paste, so if you're using a modified version of the above that accepts an existing Document (as I was), then you need to first save the doc to a temp path, instantiate a new Document from that path, and clean up the temp path afterwards (best way I could find to clone a Document). – Luke Sawczak Sep 15 '20 at 23:44
5

If your needs are simple, something like this might work:

source_document = Document('source.docx')
target_document = Document()

for paragraph in source_document.paragraphs:
    text = paragraph.text
    target_document.add_paragraph(text)

There are additional things you can do, but that should get you started.

It turns out that copying content from one Word file to another is quite complex in the general case, involving things like reconciling styles present in the source document that may be conflicting in the target document for example. So it's not a feature we're likely to be adding in the next year, say.

scanny
  • 20,022
  • 3
  • 40
  • 66
  • will it copy the tables also? @scanny – omri_saadon Jul 22 '14 at 04:11
  • 1
    No. See this page for some discussion related to that: https://github.com/python-openxml/python-docx/issues/40 – scanny Jul 22 '14 at 05:24
  • 1
    i succeeded to copy everything to a new docx file, but all the formats are gone (bold for example). is there a way of keeping them? – omri_saadon Jul 24 '14 at 14:23
  • 1
    Well, like I said, solving the problem in the general case is complex. You could probably make some progress by going down to the run level and matching bold and italic there. Each paragraph is composed of runs (to a first approximation) and the character formatting lives at the run level. – scanny Jul 25 '14 at 14:27
4

Create an empty document (empty.docx) and add your two documents to this. On each loop of the iteration over the files, add a page break if necessary.

On completion save the new file that contains your two combined files.

from docx import Document

files = ['file1.docx', 'file2.docx']

def combine_word_documents(files):
    combined_document = Document('empty.docx')
    count, number_of_files = 0, len(files)
    for file in files:
        sub_doc = Document(file)

        # Don't add a page break if you've
        # reached the last file.
        if count < number_of_files - 1:
            sub_doc.add_page_break()

        for element in sub_doc._document_part.body._element:
            combined_document._document_part.body._element.append(element)
        count += 1

    combined_document.save('combined_word_documents.docx')

combine_word_documents(files)
John Paul Hayes
  • 678
  • 5
  • 13
  • AttributeError: 'Document' object has no attribute '_document_part' ? – coachcal Sep 14 '16 at 02:45
  • @coachcal `_document_part` is "private" and _should_ not be accessed as API. In any case this is version/implementation dependent. E.g. could be gone with Python3. Try Martijn Jacobs' solution. This looks _very_ similar but doesn't use private members. I just tried that one and it worked (Python 3.5.3). – Adrian W Jun 12 '18 at 22:41
4

If you just need to combine simple documents with text, you can use python-docx as mentioned above.

If you need to merge documents containing hyperlinks, images, lists, bullet points etc. You can done this by using lxml to combining the document body and all the reference files, like:

  • word/styles.xml
  • word/numbering.xml
  • word/media
  • [Content_Types].xml

etc.

yunshi
  • 335
  • 1
  • 9
  • That sounds promising. Could you provide an example how to do this? May be as separate Question + Answer? Thanks a lot. – Adrian W Jun 13 '18 at 08:38
2

This is all very useful. I combined the answers of Martijn Jacobs and Mr Kriss.

def combine_word_documents(input_files):
    """
    :param input_files: an iterable with full paths to docs
    :return: a Document object with the merged files
    """
    for filnr, file in enumerate(input_files):
        # in my case the docx templates are in a FileField of Django, add the MEDIA_ROOT, discard the next 2 lines if not appropriate for you. 
        if 'offerte_template' in file:
            file = os.path.join(settings.MEDIA_ROOT, file)

        if filnr == 0:
            merged_document = Document(file)
            merged_document.add_page_break()

        else:
            sub_doc = Document(file)

            # Don't add a page break if you've reached the last file.
            if filnr < len(input_files)-1:
                sub_doc.add_page_break()

            for element in sub_doc.element.body:
                merged_document.element.body.append(element)

    return merged_document
MZA
  • 806
  • 9
  • 14
  • You don't need the `if filnr < len(input_files)-1:` clause if you move `merged_document.add_page_break()` to the beginning of the `else` tree. Then you will insert a page break _before_ each document except the first. – Adrian W Jun 13 '18 at 08:42
  • The headers and foooters text repeats thrice ! – Nida Sahar Nov 28 '19 at 14:25
0

Another alternative solution is Aspose.Words Cloud SDK for Python. It retains the formatting/style of the documents based on ImportFormatMode parameter. The parameter defines which formatting will be used: appended or destination document. Possible values are KeepSourceFormatting or UseDestinationStyles.

# For complete examples and data files, please go to https://github.com/aspose-words-cloud/aspose-words-cloud-python
import os
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile


# Please get your Client ID and Secret from https://dashboard.aspose.cloud.
client_id='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx'
client_secret='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

words_api = asposewordscloud.WordsApi(client_id,client_secret)
words_api.api_client.configuration.host='https://api.aspose.cloud'


remoteFolder = 'Temp'
localFolder = 'C:/Temp'
localFileName = 'destination.docx'
remoteFileName = 'destination.docx'
localFileName1 = 'source.docx'
remoteFileName1 = 'source.docx'

#upload file
words_api.upload_file(asposewordscloud.models.requests.UploadFileRequest(open(localFolder + '/' + localFileName,'rb'),remoteFolder + '/' + remoteFileName))
words_api.upload_file(asposewordscloud.models.requests.UploadFileRequest(open(localFolder + '/' + localFileName1,'rb'),remoteFolder + '/' + remoteFileName1))

#append Word documents
requestDocumentListDocumentEntries0 = asposewordscloud.DocumentEntry(href=remoteFolder + '/' + remoteFileName1, import_format_mode='KeepSourceFormatting')

requestDocumentListDocumentEntries = [requestDocumentListDocumentEntries0]
requestDocumentList = asposewordscloud.DocumentEntryList(document_entries=requestDocumentListDocumentEntries)
request = asposewordscloud.models.requests.AppendDocumentRequest(name=remoteFileName, document_list=requestDocumentList, folder=remoteFolder, dest_file_name= remoteFolder + '/' + remoteFileName)

result = words_api.append_document(request)

#download file
request_download=asposewordscloud.models.requests.DownloadFileRequest(remoteFolder + '/' + remoteFileName)
response_download = words_api.download_file(request_download)
copyfile(response_download, localFolder + '/' +"Append_output.docx")
Tilal Ahmad
  • 798
  • 4
  • 7